Hi, sorry for missing informations. I was to avoid putting too much inappropriate infos ;) Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a écrit : > Hello, > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote: > > The items below help, but be a s specific as possible, from OS, > kernel > version to Ceph version, "ceph -s", any other specific details (pool > type, > replica size). > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x kernel, and Ceph 0.80.10. I don't have anymore ceph status right now. But I have data to move tonight again, so I'll track that. The affected pool is a standard one (no erasure coding), with only 2 replica (size=2). > > Some additionnal informations : > > - I have 4 SSD per node. > Type, if nothing else for anecdotal reasons. I have 7 storage nodes here : - 3 nodes which have each 12 OSD of 300GB SSD - 4 nodes which have each 4 OSD of 800GB SSD And I'm trying to replace 12x300GB nodes by 4x800GB nodes. > > - the CPU usage is near 0 > > - IO wait is near 0 too > Including the trouble OSD(s)? Yes > Measured how, iostat or atop? iostat, htop, and confirmed with Zabbix supervisor. > > - bandwith usage is also near 0 > > > Yeah, all of the above are not surprising if everything is stuck > waiting > on some ops to finish. > > How many nodes are we talking about? 7 nodes, 52 OSDs. > > The whole cluster seems waiting for something... but I don't see > > what. > > > Is it just one specific OSD (or a set of them) or is that all over > the > place? A set of them. When I increase the weight of all 4 OSDs of a node, I frequently have blocked IO from 1 OSD of this node. > Does restarting the OSD fix things? Yes. For several minutes. > Christian > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a > > écrit : > > > Hi, > > > > > > I have a cluster with lot of blocked operations each time I try > > > to > > > move > > > data (by reweighting a little an OSD). > > > > > > It's a full SSD cluster, with 10GbE network. > > > > > > In logs, when I have blocked OSD, on the main OSD I can see that > > > : > > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow > > > requests, 1 included below; oldest blocked for > 33.976680 secs > > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow > > > request > > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821: > > > osd_op(client.29760717.1:18680817544 > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384] > > > 6.c11916a4 > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 > > > currently > > > reached pg > > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow > > > requests, 1 included below; oldest blocked for > 63.981596 secs > > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow > > > request > > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821: > > > osd_op(client.29760717.1:18680817544 > > > rb.0.1c16005.238e1f29.00000000027f [write 180224~16384] > > > 6.c11916a4 > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 > > > currently > > > reached pg > > > > > > How should I read that ? What this OSD is waiting for ? > > > > > > Thanks for any help, > > > > > > Olivier > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com