Re: Blocked ops, OSD consuming memory, hammer

Christian Balzer <chibi@xxxxxxx> · Thu, 26 May 2016 12:13:19 +0900

Hello,

On Thu, 26 May 2016 07:26:19 +0900 Shinobu Kinjo wrote:

> What will the followings show you?
> 
> ceph pg 12.258 list_unfound  // maybe hung...
> ceph pg dump_stuck
> 
> and enable debug to osd.4
> 
> debug osd = 20
> debug filestore = 20
> debug ms = 1
> 
> But honestly my best bet is to upgrade to the latest. It would save
> your life much more.
>
While upgrading to the latest version is often a good idea and definitely
a knee-jerk reaction by the developers it's not something that's done
lightly.

Especially considering the fact that Heath is using a cache tier, if he
had encountered this bug when 0.94.6 was the latest version, an upgrade
would have possibly destroyed much of his data due to the massive
cache-tier bug in that version.

I (and probably quite a few others) am also running 0.94.5 (and for the
record have not seen this problem) and am not going to upgrade to 0.94.7 or
Jewel until I have extensively tested things on my newly ordered
staging/test cluster that actually will allow me to test these things
correctly.

And while Heath's and my cluster are not large enough to run into the MON
election storm CRC bug introduced with 0.94.7 that's another bit on the
"don't upgrade blindly" list.

Regards,

Christian
>  - Shinobu
> 
> On Thu, May 26, 2016 at 5:25 AM, Heath Albritton <halbritt@xxxxxxxx>
> wrote:
> > I fear I've hit a bug as well.  Considering an upgrade to the latest
> > release of hammer.  Somewhat concerned that I may lose those PGs.
> >
> >
> > -H
> >
> >> On May 25, 2016, at 07:42, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >>
> >>> On Tue, May 24, 2016 at 11:19 PM, Heath Albritton
> >>> <halbritt@xxxxxxxx> wrote: Not going to attempt threading and
> >>> apologies for the two messages on the same topic.  Christian is
> >>> right, though.  3 nodes per tier, 8 SSDs per node in the cache tier,
> >>> 12 spinning disks in the cold tier.  10GE client network with a
> >>> separate 10GE back side network.  Each node in the cold tier has two
> >>> Intel P3700 SSDs as a journal.  This setup has yielded excellent
> >>> performance over the past year.
> >>>
> >>> The memory exhaustion comes purely from one errant OSD process.  All
> >>> the remaining processes look fairly normal in terms of memory
> >>> consumption.
> >>>
> >>> These nodes aren't particularly busy.  A random sampling shows a few
> >>> hundred kilobytes of data being written and very few reads.
> >>>
> >>> Thus far, I've done quite a bit of juggling of OSDs.  Setting the
> >>> cluster to noup.  Restarting the failed ones, letting them get to the
> >>> current map and then clearing the noup flag and letting them rejoin.
> >>> Eventually, they'll fail again and then a fairly intense recovery
> >>> happens.
> >>>
> >>> here's ceph -s:
> >>>
> >>> https://dl.dropboxusercontent.com/u/90634073/ceph/ceph_dash_ess.txt
> >>>
> >>> Cluster has been in this state for a while.  There are 3 PGs that
> >>> seem to be problematic:
> >>>
> >>> [root@t2-node01 ~]# pg dump | grep recovering
> >>> -bash: pg: command not found
> >>> [root@t2-node01 ~]# ceph pg dump | grep recovering
> >>> dumped all in format plain
> >>> 9.2f1 1353 1075 4578 1353 1075 9114357760 2611 2611
> >>> active+recovering+degraded+remapped 2016-05-24 21:49:26.766924
> >>> 8577'2611 8642:84 [15,31] 15 [15,31,0] 15 5123'2483 2016-05-23
> >>> 23:52:54.360710 5123'2483 2016-05-23 23:52:54.360710
> >>> 12.258 878 875 2628 0 0 4414509568 1534 1534
> >>> active+recovering+undersized+degraded 2016-05-24 21:47:48.085476
> >>> 4261'1534 8587:17712 [4,20] 4 [4,20] 4 4261'1534 2016-05-23
> >>> 07:22:44.819208 4261'1534 2016-05-23 07:22:44.819208
> >>> 11.58 376 0 1 2223 0 1593129984 4909 4909
> >>> active+recovering+degraded+remapped 2016-05-24 05:49:07.531198
> >>> 8642'409248 8642:406269 [56,49,41] 56 [40,48,62] 40 4261'406995
> >>> 2016-05-22 21:40:40.205540 4261'406450 2016-05-21 21:37:35.497307
> >>>
> >>> pg 9.2f1 query:
> >>> https://dl.dropboxusercontent.com/u/90634073/ceph/pg_9.21f.txt
> >>>
> >>> When I query 12.258 it just hangs
> >>>
> >>> pg 11.58 query:
> >>> https://dl.dropboxusercontent.com/u/90634073/ceph/pg_11.58.txt
> >>
> >> Well, you've clearly had some things go very wrong. That "undersized"
> >> means that the pg doesn't have enough copies to be allowed to process
> >> writes, and I'm a little confused that it's also marked active but I
> >> don't quite remember the PG state diagrams involved. You should
> >> consider it down; it should be trying to recover itself though. I'm
> >> not quite certain if the query is considered an operation it's not
> >> allowed to service (which the RADOS team will need to fix, if it's not
> >> done already in later releases) or if the query hanging is indicative
> >> of yet another problem.
> >>
> >> The memory expansion is probably operations incoming on some of those
> >> missing objects, or on the PG which can't take writes (but is trying
> >> to recover itself to a state where it *can*). In general it shouldn't
> >> be enough to exhaust the memory in the system, but you might have
> >> mis-tuned things so that clients are allowed to use up a lot more
> >> memory than is appropriate, or there might be a bug in v0.94.5.
> >> -Greg
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com