Re: OSDs in EC pool flapping

<george.vasilakakos@xxxxxxxxxx> · Wed, 23 Aug 2017 09:11:08 +0000

No, nothing like that.

The cluster is in the process of having more OSDs added and, while that was ongoing, one was removed because the underlying disk was throwing up a bunch of read errors.
Shortly after, the first three OSDs in this PG started crashing with error messages about corrupted EC shards. We seemed to be running into http://tracker.ceph.com/issues/18624 so we moved on to 11.2.1 which essentially means they now fail with a different error message. Our problem looks a bit like this: http://tracker.ceph.com/issues/18162

For a bit more context here's two more events going backwards in the dump:

    -3> 2017-08-22 17:42:09.443216 7fa2e283d700  0 osd.1290 pg_epoch: 73324 pg[1.138s0( v 73085'430014 (62760'421568,73
085'430014] local-les=73323 n=22919 ec=764 les/c/f 73323/72881/0 73321/73322/73322) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 lpr=73322 pi=72880-73321/179 rops=1 bft=1513
(7) crt=73085'430014 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] failed_push 1:1c959fdd:::datad
isk%2frucio%2fmc16_13TeV%2f41%2f30%2fAOD.11927271._003020.pool.root.1.0000000000000000:head from shard 177(4), reps on 
 unfound? 0
    -2> 2017-08-22 17:42:09.443299 7fa2e283d700  5 -- op tracker -- seq: 490, time: 2017-08-22 17:42:09.443297, event: 
done, op: MOSDECSubOpReadReply(1.138s0 73324 ECSubReadReply(tid=5, attrs_read=0))

No amount of taking OSDs out or restarting them fixes it. At this point we've had the first 3 marked out by ceph because they flapped enough that systemd gave up trying to restart them, they stayed down long enough and mon_osd_down_out_interval expired. Now the pg map looks like this:

# ceph pg map 1.138
osdmap e73599 pg 1.138 (1.138) -> up [111,1325,437,456,177,1094,194,1513,236,302,1326] acting [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,1326]

Seeing the #18162, it looks a lot like what we're seeing in our production system (which is experiencing a service outage because of this) but the fact that the issue is marked as minor severity and hasn't had any updates in two months is disconcerting.

As for deep scrubbing it sounds like it could possibly work in a general corruption situation but not with a PG stuck in down+remapped and it's first 3 OSDs crashing out after 5' of operation.

Thanks, 

George

From: Paweł Woszuk [pwoszuk@xxxxxxxxxxxxx]

Sent: 22 August 2017 19:19

To: ceph-users@xxxxxxxxxxxxxx; Vasilakakos, George (STFC,RAL,SC)

Subject: Re:  OSDs in EC pool flapping

Have you experienced huge memory consumption by flapping OSD daemons? Restart could be triggered by no memory (omkiller).

If yes,this could be connected with osd device error,(bad blocks?), but we've experienced something similar on Jewel, not Kraken release. Solution was to find PG that cause error, set it to deep scrub manually and restart PG's primary OSD.

Hope that helps, or at least lead to some solution.

Dnia 22 sierpnia 2017 18:39:47 CEST, george.vasilakakos@xxxxxxxxxx napisał(a):

Hey folks,

I'm staring at a problem that I have found no solution for and which is causing major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only to crash again with the following error in their logs:

    -1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
     0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) **
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again the exact same way. Eventually systemd gives, the mon_osd_down_out_interval expires and the PG just stays down+remapped while other recover and go active+clean.

Can anybody help with this type of problem?

Best regards,

George Vasilakakos

ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Paweł Woszuk

PCSS, Poznańskie Centrum Superkomputerowo-Sieciowe

ul. Jana Pawła II nr 10, 61-139 Poznań

Polska

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com