What will the followings show you? ceph pg 12.258 list_unfound // maybe hung... ceph pg dump_stuck and enable debug to osd.4 debug osd = 20 debug filestore = 20 debug ms = 1 But honestly my best bet is to upgrade to the latest. It would save your life much more. - Shinobu On Thu, May 26, 2016 at 5:25 AM, Heath Albritton <halbritt@xxxxxxxx> wrote: > I fear I've hit a bug as well. Considering an upgrade to the latest release of hammer. Somewhat concerned that I may lose those PGs. > > > -H > >> On May 25, 2016, at 07:42, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> >>> On Tue, May 24, 2016 at 11:19 PM, Heath Albritton <halbritt@xxxxxxxx> wrote: >>> Not going to attempt threading and apologies for the two messages on >>> the same topic. Christian is right, though. 3 nodes per tier, 8 SSDs >>> per node in the cache tier, 12 spinning disks in the cold tier. 10GE >>> client network with a separate 10GE back side network. Each node in >>> the cold tier has two Intel P3700 SSDs as a journal. This setup has >>> yielded excellent performance over the past year. >>> >>> The memory exhaustion comes purely from one errant OSD process. All >>> the remaining processes look fairly normal in terms of memory >>> consumption. >>> >>> These nodes aren't particularly busy. A random sampling shows a few >>> hundred kilobytes of data being written and very few reads. >>> >>> Thus far, I've done quite a bit of juggling of OSDs. Setting the >>> cluster to noup. Restarting the failed ones, letting them get to the >>> current map and then clearing the noup flag and letting them rejoin. >>> Eventually, they'll fail again and then a fairly intense recovery >>> happens. >>> >>> here's ceph -s: >>> >>> https://dl.dropboxusercontent.com/u/90634073/ceph/ceph_dash_ess.txt >>> >>> Cluster has been in this state for a while. There are 3 PGs that seem >>> to be problematic: >>> >>> [root@t2-node01 ~]# pg dump | grep recovering >>> -bash: pg: command not found >>> [root@t2-node01 ~]# ceph pg dump | grep recovering >>> dumped all in format plain >>> 9.2f1 1353 1075 4578 1353 1075 9114357760 2611 2611 >>> active+recovering+degraded+remapped 2016-05-24 21:49:26.766924 >>> 8577'2611 8642:84 [15,31] 15 [15,31,0] 15 5123'2483 2016-05-23 >>> 23:52:54.360710 5123'2483 2016-05-23 23:52:54.360710 >>> 12.258 878 875 2628 0 0 4414509568 1534 1534 >>> active+recovering+undersized+degraded 2016-05-24 21:47:48.085476 >>> 4261'1534 8587:17712 [4,20] 4 [4,20] 4 4261'1534 2016-05-23 >>> 07:22:44.819208 4261'1534 2016-05-23 07:22:44.819208 >>> 11.58 376 0 1 2223 0 1593129984 4909 4909 >>> active+recovering+degraded+remapped 2016-05-24 05:49:07.531198 >>> 8642'409248 8642:406269 [56,49,41] 56 [40,48,62] 40 4261'406995 >>> 2016-05-22 21:40:40.205540 4261'406450 2016-05-21 21:37:35.497307 >>> >>> pg 9.2f1 query: >>> https://dl.dropboxusercontent.com/u/90634073/ceph/pg_9.21f.txt >>> >>> When I query 12.258 it just hangs >>> >>> pg 11.58 query: >>> https://dl.dropboxusercontent.com/u/90634073/ceph/pg_11.58.txt >> >> Well, you've clearly had some things go very wrong. That "undersized" >> means that the pg doesn't have enough copies to be allowed to process >> writes, and I'm a little confused that it's also marked active but I >> don't quite remember the PG state diagrams involved. You should >> consider it down; it should be trying to recover itself though. I'm >> not quite certain if the query is considered an operation it's not >> allowed to service (which the RADOS team will need to fix, if it's not >> done already in later releases) or if the query hanging is indicative >> of yet another problem. >> >> The memory expansion is probably operations incoming on some of those >> missing objects, or on the PG which can't take writes (but is trying >> to recover itself to a state where it *can*). In general it shouldn't >> be enough to exhaust the memory in the system, but you might have >> mis-tuned things so that clients are allowed to use up a lot more >> memory than is appropriate, or there might be a bug in v0.94.5. >> -Greg > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Email: shinobu@xxxxxxxxx shinobu@xxxxxxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com