I fear I've hit a bug as well. Considering an upgrade to the latest release of hammer. Somewhat concerned that I may lose those PGs. -H > On May 25, 2016, at 07:42, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > >> On Tue, May 24, 2016 at 11:19 PM, Heath Albritton <halbritt@xxxxxxxx> wrote: >> Not going to attempt threading and apologies for the two messages on >> the same topic. Christian is right, though. 3 nodes per tier, 8 SSDs >> per node in the cache tier, 12 spinning disks in the cold tier. 10GE >> client network with a separate 10GE back side network. Each node in >> the cold tier has two Intel P3700 SSDs as a journal. This setup has >> yielded excellent performance over the past year. >> >> The memory exhaustion comes purely from one errant OSD process. All >> the remaining processes look fairly normal in terms of memory >> consumption. >> >> These nodes aren't particularly busy. A random sampling shows a few >> hundred kilobytes of data being written and very few reads. >> >> Thus far, I've done quite a bit of juggling of OSDs. Setting the >> cluster to noup. Restarting the failed ones, letting them get to the >> current map and then clearing the noup flag and letting them rejoin. >> Eventually, they'll fail again and then a fairly intense recovery >> happens. >> >> here's ceph -s: >> >> https://dl.dropboxusercontent.com/u/90634073/ceph/ceph_dash_ess.txt >> >> Cluster has been in this state for a while. There are 3 PGs that seem >> to be problematic: >> >> [root@t2-node01 ~]# pg dump | grep recovering >> -bash: pg: command not found >> [root@t2-node01 ~]# ceph pg dump | grep recovering >> dumped all in format plain >> 9.2f1 1353 1075 4578 1353 1075 9114357760 2611 2611 >> active+recovering+degraded+remapped 2016-05-24 21:49:26.766924 >> 8577'2611 8642:84 [15,31] 15 [15,31,0] 15 5123'2483 2016-05-23 >> 23:52:54.360710 5123'2483 2016-05-23 23:52:54.360710 >> 12.258 878 875 2628 0 0 4414509568 1534 1534 >> active+recovering+undersized+degraded 2016-05-24 21:47:48.085476 >> 4261'1534 8587:17712 [4,20] 4 [4,20] 4 4261'1534 2016-05-23 >> 07:22:44.819208 4261'1534 2016-05-23 07:22:44.819208 >> 11.58 376 0 1 2223 0 1593129984 4909 4909 >> active+recovering+degraded+remapped 2016-05-24 05:49:07.531198 >> 8642'409248 8642:406269 [56,49,41] 56 [40,48,62] 40 4261'406995 >> 2016-05-22 21:40:40.205540 4261'406450 2016-05-21 21:37:35.497307 >> >> pg 9.2f1 query: >> https://dl.dropboxusercontent.com/u/90634073/ceph/pg_9.21f.txt >> >> When I query 12.258 it just hangs >> >> pg 11.58 query: >> https://dl.dropboxusercontent.com/u/90634073/ceph/pg_11.58.txt > > Well, you've clearly had some things go very wrong. That "undersized" > means that the pg doesn't have enough copies to be allowed to process > writes, and I'm a little confused that it's also marked active but I > don't quite remember the PG state diagrams involved. You should > consider it down; it should be trying to recover itself though. I'm > not quite certain if the query is considered an operation it's not > allowed to service (which the RADOS team will need to fix, if it's not > done already in later releases) or if the query hanging is indicative > of yet another problem. > > The memory expansion is probably operations incoming on some of those > missing objects, or on the PG which can't take writes (but is trying > to recover itself to a state where it *can*). In general it shouldn't > be enough to exhaust the memory in the system, but you might have > mis-tuned things so that clients are allowed to use up a lot more > memory than is appropriate, or there might be a bug in v0.94.5. > -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com