Re: Lot of blocked operations

Christian Balzer <chibi@xxxxxxx> · Fri, 18 Sep 2015 18:24:44 +0900

Hello,

On Fri, 18 Sep 2015 10:35:37 +0200 Olivier Bonvalet wrote:

> Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit :
> > Hello,
> > 
> > On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:
> > 
> > > Hi,
> > > 
> > > sorry for missing informations. I was to avoid putting too much
> > > inappropriate infos ;)
> > > 
> > Nah, everything helps, there are known problems with some versions,
> > kernels, file systems, etc.
> > 
> > Speaking of which, what FS are you using on your OSDs?
> > 
> 
> XFS.
>
No surprises there, one hopes.

> > > 
> > > 
> > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > écrit :
> > > > Hello,
> > > > 
> > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > 
> > > > The items below help, but be a s specific as possible, from OS,
> > > > kernel
> > > > version to Ceph version, "ceph -s", any other specific details
> > > > (pool
> > > > type,
> > > > replica size).
> > > > 
> > > 
> > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > kernel,
> > > and Ceph 0.80.10.
> > All my stuff is on Jessie, but at least Firefly should be stable and
> > I
> > haven't seen anything like your problem with it.
> > And while 3.14 is a LTS kernel I wonder if something newer may be
> > beneficial, but probably not.
> > 
> 
> Well, I can try a 3.18.x kernel. But for that I have to restart all
> nodes, which will throw some backfilling and probably some blocked IO
> too ;)
>
Yeah, as I said, it might be helpful in some other ways, but probably not
related to your problems.

> 
> > > I don't have anymore ceph status right now. But I have
> > > data to move tonight again, so I'll track that.
> > > 
> > I was interested in that to see how many pools and PGs you have.
> 
> Well :
> 
>     cluster de035250-323d-4cf6-8c4b-cf0faf6296b1
>      health HEALTH_OK
>      monmap e21: 3 mons at
> {faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0},
> election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh osdmap e847496:
> 88 osds: 88 up, 87 in pgmap v86390609: 6632 pgs, 16 pools, 18883 GB
> data, 5266 kobjects 68559 GB used, 59023 GB / 124 TB avail 6632
> active+clean client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s
> 
> 
> There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd
> pool use different OSD, on different nodes.
> Since I don't often balance data of this hdd pool, I don't yet see
> problem on it.
> 
How many PGs in the SSD pool? 
I can see this easily exceeding your open file limits.

> 
> 
> > >  The affected pool is a standard one (no erasure coding), with only
> > > 2
> > > replica (size=2).
> > > 
> > Good, nothing fancy going on there then.
> > 
> > > 
> > > 
> > > 
> > > > > Some additionnal informations :
> > > > > - I have 4 SSD per node.
> > > > Type, if nothing else for anecdotal reasons.
> > > 
> > > I have 7 storage nodes here :
> > > - 3 nodes which have each 12 OSD of 300GB
> > > SSD
> > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > 
> > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > 
> > Type as in model/maker, but helpful information.
> > 
> 
> 300GB models are Intel SSDSC2BB300G4 (DC S3500).
> 800GB models are Intel SSDSC2BB800H4 (DC S3500 I think).
> 
0.3 DWPD, but I guess you know that.

> 
> 
> 
> > > 
> > > 
> > > > > - the CPU usage is near 0
> > > > > - IO wait is near 0 too
> > > > Including the trouble OSD(s)?
> > > 
> > > Yes
> > > 
> > > 
> > > > Measured how, iostat or atop?
> > > 
> > > iostat, htop, and confirmed with Zabbix supervisor.
> > > 
> > 
> > Good. I'm sure you checked for network errors. 
> > Single network or split client/cluster network?
> > 
> 
> It's the first thing I checked, and latency and packet loss is
> monitored between each node and mons, but maybe I forgot some checks.
> 
> 
> > > 
> > > 
> > > 
> > > > > - bandwith usage is also near 0
> > > > > 
> > > > Yeah, all of the above are not surprising if everything is stuck
> > > > waiting
> > > > on some ops to finish. 
> > > > 
> > > > How many nodes are we talking about?
> > > 
> > > 
> > > 7 nodes, 52 OSDs.
> > > 
> > That be below the threshold for most system tunables (there are
> > various
> > threads and articles on how to tune Ceph for "large" clusters).
> > 
> > Since this happens only when your cluster reshuffles data (and thus
> > has
> > more threads going) what is your ulimit setting for open files?
> 
> 
> Wow... the default one on Debian Wheezy : 1024.
>
You want to fix this, both during startup and in general.

I use this for sysv-init systems like Wheezy:
---
# cat /etc/initscript 
#
ulimit -Hn 65536
ulimit -Sn 16384

# Execute the program.
eval exec "$4"
---

and:
---
# cat /etc/security/limits.d/tuning.conf 
root            soft    nofile          16384
root            hard    nofile          65536
*               soft    nofile          16384
*               hard    nofile          65536
---

Adjust as needed.

Christian
> 
> 
> > > 
> > > 
> > > > > The whole cluster seems waiting for something... but I don't
> > > > > see
> > > > > what.
> > > > > 
> > > > Is it just one specific OSD (or a set of them) or is that all
> > > > over
> > > > the
> > > > place?
> > > 
> > > A set of them. When I increase the weight of all 4 OSDs of a node,
> > > I
> > > frequently have blocked IO from 1 OSD of this node.
> > > 
> > The plot thickens, as in, the target of most writes (new PGs being
> > moved
> > there) is the culprit.
> > 
> 
> Yes.
> 
> 
> > > 
> > > 
> > > > Does restarting the OSD fix things?
> > > 
> > > Yes. For several minutes.
> > > 
> > That also ties into a resource starvation of sorts, I'd investigate
> > along
> > those lines.
> 
> Yes, I agree. I will increase verbosity of OSD.
> 
> 
> > Christian
> > > 
> > > > Christian
> > > > > 
> > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > > > écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > I have a cluster with lot of blocked operations each time I
> > > > > > try
> > > > > > to
> > > > > > move
> > > > > > data (by reweighting a little an OSD).
> > > > > > 
> > > > > > It's a full SSD cluster, with 10GbE network.
> > > > > > 
> > > > > > In logs, when I have blocked OSD, on the main OSD I can see
> > > > > > that
> > > > > > :
> > > > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > > requests, 1 included below; oldest blocked for > 33.976680
> > > > > > secs
> > > > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > > > request
> > > > > > 30.125556 seconds old, received at 2015-09-18
> > > > > > 01:54:46.855821:
> > > > > > osd_op(client.29760717.1:18680817544
> > > > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384]
> > > > > > 6.c11916a4
> > > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > > > currently
> > > > > > reached pg
> > > > > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > > requests, 1 included below; oldest blocked for > 63.981596
> > > > > > secs
> > > > > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > > > > request
> > > > > > 60.130472 seconds old, received at 2015-09-18
> > > > > > 01:54:46.855821:
> > > > > > osd_op(client.29760717.1:18680817544
> > > > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384]
> > > > > > 6.c11916a4
> > > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > > > currently
> > > > > > reached pg
> > > > > > 
> > > > > > How should I read that ? What this OSD is waiting for ?
> > > > > > 
> > > > > > Thanks for any help,
> > > > > > 
> > > > > > Olivier
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > 
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> > > > 
> > > 
> > 
> > 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com