Hello, On Fri, 18 Sep 2015 10:35:37 +0200 Olivier Bonvalet wrote: > Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit : > > Hello, > > > > On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote: > > > > > Hi, > > > > > > sorry for missing informations. I was to avoid putting too much > > > inappropriate infos ;) > > > > > Nah, everything helps, there are known problems with some versions, > > kernels, file systems, etc. > > > > Speaking of which, what FS are you using on your OSDs? > > > > XFS. > No surprises there, one hopes. > > > > > > > > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a > > > écrit : > > > > Hello, > > > > > > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote: > > > > > > > > The items below help, but be a s specific as possible, from OS, > > > > kernel > > > > version to Ceph version, "ceph -s", any other specific details > > > > (pool > > > > type, > > > > replica size). > > > > > > > > > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x > > > kernel, > > > and Ceph 0.80.10. > > All my stuff is on Jessie, but at least Firefly should be stable and > > I > > haven't seen anything like your problem with it. > > And while 3.14 is a LTS kernel I wonder if something newer may be > > beneficial, but probably not. > > > > Well, I can try a 3.18.x kernel. But for that I have to restart all > nodes, which will throw some backfilling and probably some blocked IO > too ;) > Yeah, as I said, it might be helpful in some other ways, but probably not related to your problems. > > > > I don't have anymore ceph status right now. But I have > > > data to move tonight again, so I'll track that. > > > > > I was interested in that to see how many pools and PGs you have. > > Well : > > cluster de035250-323d-4cf6-8c4b-cf0faf6296b1 > health HEALTH_OK > monmap e21: 3 mons at > {faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0}, > election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh osdmap e847496: > 88 osds: 88 up, 87 in pgmap v86390609: 6632 pgs, 16 pools, 18883 GB > data, 5266 kobjects 68559 GB used, 59023 GB / 124 TB avail 6632 > active+clean client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s > > > There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd > pool use different OSD, on different nodes. > Since I don't often balance data of this hdd pool, I don't yet see > problem on it. > How many PGs in the SSD pool? I can see this easily exceeding your open file limits. > > > > > The affected pool is a standard one (no erasure coding), with only > > > 2 > > > replica (size=2). > > > > > Good, nothing fancy going on there then. > > > > > > > > > > > > > > > > Some additionnal informations : > > > > > - I have 4 SSD per node. > > > > Type, if nothing else for anecdotal reasons. > > > > > > I have 7 storage nodes here : > > > - 3 nodes which have each 12 OSD of 300GB > > > SSD > > > - 4 nodes which have each 4 OSD of 800GB SSD > > > > > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes. > > > > > Type as in model/maker, but helpful information. > > > > 300GB models are Intel SSDSC2BB300G4 (DC S3500). > 800GB models are Intel SSDSC2BB800H4 (DC S3500 I think). > 0.3 DWPD, but I guess you know that. > > > > > > > > > > > > > > - the CPU usage is near 0 > > > > > - IO wait is near 0 too > > > > Including the trouble OSD(s)? > > > > > > Yes > > > > > > > > > > Measured how, iostat or atop? > > > > > > iostat, htop, and confirmed with Zabbix supervisor. > > > > > > > Good. I'm sure you checked for network errors. > > Single network or split client/cluster network? > > > > It's the first thing I checked, and latency and packet loss is > monitored between each node and mons, but maybe I forgot some checks. > > > > > > > > > > > > > > > > - bandwith usage is also near 0 > > > > > > > > > Yeah, all of the above are not surprising if everything is stuck > > > > waiting > > > > on some ops to finish. > > > > > > > > How many nodes are we talking about? > > > > > > > > > 7 nodes, 52 OSDs. > > > > > That be below the threshold for most system tunables (there are > > various > > threads and articles on how to tune Ceph for "large" clusters). > > > > Since this happens only when your cluster reshuffles data (and thus > > has > > more threads going) what is your ulimit setting for open files? > > > Wow... the default one on Debian Wheezy : 1024. > You want to fix this, both during startup and in general. I use this for sysv-init systems like Wheezy: --- # cat /etc/initscript # ulimit -Hn 65536 ulimit -Sn 16384 # Execute the program. eval exec "$4" --- and: --- # cat /etc/security/limits.d/tuning.conf root soft nofile 16384 root hard nofile 65536 * soft nofile 16384 * hard nofile 65536 --- Adjust as needed. Christian > > > > > > > > > > > > > The whole cluster seems waiting for something... but I don't > > > > > see > > > > > what. > > > > > > > > > Is it just one specific OSD (or a set of them) or is that all > > > > over > > > > the > > > > place? > > > > > > A set of them. When I increase the weight of all 4 OSDs of a node, > > > I > > > frequently have blocked IO from 1 OSD of this node. > > > > > The plot thickens, as in, the target of most writes (new PGs being > > moved > > there) is the culprit. > > > > Yes. > > > > > > > > > > > > Does restarting the OSD fix things? > > > > > > Yes. For several minutes. > > > > > That also ties into a resource starvation of sorts, I'd investigate > > along > > those lines. > > Yes, I agree. I will increase verbosity of OSD. > > > > Christian > > > > > > > Christian > > > > > > > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a > > > > > écrit : > > > > > > Hi, > > > > > > > > > > > > I have a cluster with lot of blocked operations each time I > > > > > > try > > > > > > to > > > > > > move > > > > > > data (by reweighting a little an OSD). > > > > > > > > > > > > It's a full SSD cluster, with 10GbE network. > > > > > > > > > > > > In logs, when I have blocked OSD, on the main OSD I can see > > > > > > that > > > > > > : > > > > > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow > > > > > > requests, 1 included below; oldest blocked for > 33.976680 > > > > > > secs > > > > > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow > > > > > > request > > > > > > 30.125556 seconds old, received at 2015-09-18 > > > > > > 01:54:46.855821: > > > > > > osd_op(client.29760717.1:18680817544 > > > > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384] > > > > > > 6.c11916a4 > > > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 > > > > > > currently > > > > > > reached pg > > > > > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow > > > > > > requests, 1 included below; oldest blocked for > 63.981596 > > > > > > secs > > > > > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow > > > > > > request > > > > > > 60.130472 seconds old, received at 2015-09-18 > > > > > > 01:54:46.855821: > > > > > > osd_op(client.29760717.1:18680817544 > > > > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384] > > > > > > 6.c11916a4 > > > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 > > > > > > currently > > > > > > reached pg > > > > > > > > > > > > How should I read that ? What this OSD is waiting for ? > > > > > > > > > > > > Thanks for any help, > > > > > > > > > > > > Olivier > > > > > > _______________________________________________ > > > > > > ceph-users mailing list > > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com