Ceph writes stall for long perioids with no disk/network activity

chibi@xxxxxxx (Christian Balzer) · Wed, 6 Aug 2014 23:34:48 +0900

On Wed, 6 Aug 2014 09:19:57 -0400 Chris Kitzmiller wrote:

> On Aug 5, 2014, at 12:43 PM, Mark Nelson wrote:
> > On 08/05/2014 08:42 AM, Mariusz Gronczewski wrote:
> >> On Mon, 04 Aug 2014 15:32:50 -0500, Mark Nelson
> >> <mark.nelson at inktank.com> wrote:
> >>> On 08/04/2014 03:28 PM, Chris Kitzmiller wrote:
> >>>> On Aug 1, 2014, at 1:31 PM, Mariusz Gronczewski wrote:
> >>>>> I got weird stalling during writes, sometimes I got same write
> >>>>> speed for few minutes and after some time it starts stalling with
> >>>>> 0 MB/s for minutes
> >>>> 
> >>>> I'm getting very similar behavior on my cluster. My writes start
> >>>> well but then just kinda stop for a while and then bump along
> >>>> slowly until the bench finishes. I've got a thread about it going
> >>>> here called "Ceph runs great then falters".
> >>> 
> >>> This kind of behaviour often results when the journal can write much
> >>> faster than the OSD data disks.  Initially the journals will be able
> >>> to absorb the data and things will run along well, but eventually
> >>> ceph will need to stall writes if things get too out of sync.  You
> >>> may want to take a look at what's happening on the data disks during
> >>> your tests to see if there's anything that looks suspect.  Checking
> >>> the admin socket for dump_historic_ops might provide some clues as
> >>> well.
> >>> 
> >>> Mark
> >> 
> >> I did check journals already, they are on same disk as data (separate
> >> partition) and during stalls there is no traffic to both of them (like
> >> 8 iops on average with 0% io wait).
> > 
> > This may indicate that 1 OSD could be backing up with possibly most if
> > not all IOs waiting on it.  The idea here is that because data
> > placement is deterministic, if 1 OSD is slow, over time just by random
> > chance all outstanding client operations will back up on it.  Having
> > more concurrency gives you more wiggle room but may not ultimately
> > solve it.
> > 
> > It's also possible that something else may be causing the OSDs to
> > wait.  dump_historic_ops might help.
> 
> This turns out to have been my problem. Monitoring my cluster with atop
> (thanks, Christian Balzer) during one of these incidents found that a
> single HDD (out of 90) was pegged to 100% utilization. I replaced the
> drive and have since written over 20TB of data to my RBD device without
> issue.
> 
No worries, I'm happy that it helped and turned out to be the most likely
suspect. 

Now your disks don't have these SMART parameters that my equivalent
Toshiba ones have:
---
# smartctl -a /dev/sdg |grep Perfor
  2 Throughput_Performance  0x0005   139   139   054    Pre-fail  Offline      -       72
  8 Seek_Time_Performance   0x0005   117   117   020    Pre-fail  Offline      -       36
---

And I wouldn't trust them entirely as in base my judgment of a disk just on
those, but they are a good start to see if a disk is probably
underperforming or not.

With the previous generation of Seagates I had disks that showed no signs
of trouble with the available SMART parameters, but when testing them they
were performing at as little as 60% of "healthy" one.

You might want to cobble up a script that (when your cluster is at idle,
at steady state or offline) tests the speeds of each and every disk.

> I'm not sure I fully understand what's going on when this happens but it
> is pretty clear that it isn't happening any more. 

> It would be great to
> have some sort of warning to say that the load on a single disk is
> disproportionate to the rest of the cluster.

While I agree, doing that in a "sensible" way might be quite hard. 
The high load of a single OSD is likely to be the cause of some problem
(disk, link, controller, etc) but it  could also be just bad luck at the
poker table that is CRUSH. 
As in, by pure chance your most I/O intensive VMs or RGW or whatever are
hitting the same PG(s), creating a hot spot. If all the action happens
within a 4MB Ceph object size, not THAT unlikely either. 

If you scour the archives of this ML you'll find some people graphing each
and every OSD, node and more. 
Tedious, but at 90 HDDs or more probably a very good idea.

[snip]
> >> 
> >> I've checked for network or IO load on every node and they are just
> >> not doing anything, no kernel errors, and those nodes worked fine
> >> under load when they were us
> > 
> > I'm guessing the issue is probably going to be more subtle than that
> > unfortunately.  At least based on prior issues, it seems like often
> > something is causing latency in some part of the system and when that
> > happens it can have very far-reaching effects.
> 
> 
> I've often wished for some sort of bottleneck finder for ceph. An easy
> way for the system to say where it is experiencing critical latencies
> e.g. network, journals, osd data disks, etc. This would assist
> troubleshooting and initial deployments immensely.

As mentioned above, it's tricky. 
Most certainly desirable, but the ole Mark I eyeball and wetware is quite
good at spotting these when presented with appropriate input like atop.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/