Re: Lot of blocked operations

Christian Balzer <chibi@xxxxxxx> · Fri, 18 Sep 2015 17:04:29 +0900

Hello,

On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:

> Hi,
> 
> sorry for missing informations. I was to avoid putting too much
> inappropriate infos ;)
> 
Nah, everything helps, there are known problems with some versions,
kernels, file systems, etc.

Speaking of which, what FS are you using on your OSDs?

> 
> 
> Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a écrit :
> > Hello,
> > 
> > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > 
> > The items below help, but be a s specific as possible, from OS,
> > kernel
> > version to Ceph version, "ceph -s", any other specific details (pool
> > type,
> > replica size).
> > 
> 
> So, all nodes use Debian Wheezy, running on a vanilla 3.14.x kernel,
> and Ceph 0.80.10.
All my stuff is on Jessie, but at least Firefly should be stable and I
haven't seen anything like your problem with it.
And while 3.14 is a LTS kernel I wonder if something newer may be
beneficial, but probably not.

> I don't have anymore ceph status right now. But I have
> data to move tonight again, so I'll track that.
>
I was interested in that to see how many pools and PGs you have.

> The affected pool is a standard one (no erasure coding), with only 2
> replica (size=2).
> 
Good, nothing fancy going on there then.

> 
> 
> 
> > > Some additionnal informations :
> > > - I have 4 SSD per node.
> > Type, if nothing else for anecdotal reasons.
> 
> I have 7 storage nodes here :
> - 3 nodes which have each 12 OSD of 300GB
> SSD
> - 4 nodes which have each  4 OSD of 800GB SSD
> 
> And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> 
Type as in model/maker, but helpful information.

> 
> 
> > > - the CPU usage is near 0
> > > - IO wait is near 0 too
> > Including the trouble OSD(s)?
> 
> Yes
> 
> 
> > Measured how, iostat or atop?
> 
> iostat, htop, and confirmed with Zabbix supervisor.
>

Good. I'm sure you checked for network errors. 
Single network or split client/cluster network?

> 
> 
> 
> > > - bandwith usage is also near 0
> > > 
> > Yeah, all of the above are not surprising if everything is stuck
> > waiting
> > on some ops to finish. 
> > 
> > How many nodes are we talking about?
> 
> 
> 7 nodes, 52 OSDs.
> 
That be below the threshold for most system tunables (there are various
threads and articles on how to tune Ceph for "large" clusters).

Since this happens only when your cluster reshuffles data (and thus has
more threads going) what is your ulimit setting for open files?

> 
> 
> > > The whole cluster seems waiting for something... but I don't see
> > > what.
> > > 
> > Is it just one specific OSD (or a set of them) or is that all over
> > the
> > place?
> 
> A set of them. When I increase the weight of all 4 OSDs of a node, I
> frequently have blocked IO from 1 OSD of this node.
> 
The plot thickens, as in, the target of most writes (new PGs being moved
there) is the culprit.

> 
> 
> > Does restarting the OSD fix things?
> 
> Yes. For several minutes.
> 
That also ties into a resource starvation of sorts, I'd investigate along
those lines.

Christian
> 
> > Christian
> > > 
> > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > écrit :
> > > > Hi,
> > > > 
> > > > I have a cluster with lot of blocked operations each time I try
> > > > to
> > > > move
> > > > data (by reweighting a little an OSD).
> > > > 
> > > > It's a full SSD cluster, with 10GbE network.
> > > > 
> > > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > > :
> > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > requests, 1 included below; oldest blocked for > 33.976680 secs
> > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > request
> > > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > > osd_op(client.29760717.1:18680817544
> > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384]
> > > > 6.c11916a4
> > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > currently
> > > > reached pg
> > > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > requests, 1 included below; oldest blocked for > 63.981596 secs
> > > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > > request
> > > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > > osd_op(client.29760717.1:18680817544
> > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384]
> > > > 6.c11916a4
> > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > currently
> > > > reached pg
> > > > 
> > > > How should I read that ? What this OSD is waiting for ?
> > > > 
> > > > Thanks for any help,
> > > > 
> > > > Olivier
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com