Re: help troubleshooting some osd communication problems

Mike Lovell <mike.lovell@xxxxxxxxxxxxx> · Fri, 29 Apr 2016 11:53:29 -0500

On Fri, Apr 29, 2016 at 9:34 AM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote:
On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov <asheplyakov@xxxxxxxxxxxx> wrote:
Hi,

> i also wonder if just taking 148 out of the cluster (probably just marking it out) would help

As far as I understand this can only harm your data. The acting set of PG 17.73 is  [41, 148],
so after stopping/taking out OSD 148  OSD 41 will store the only copy of objects in PG 17.73
(so it won't accept writes any more).

> since there are other osds in the up set (140 and 5)

These OSDs are not in the acting set, they have no (at least some of the) objects from PG 17.73,
and are copying the missing objects from OSDs 41 and 148. Naturally this slows down or even
blocks writes to PG 17.73.

k. i didn't know if it could just use the members of the up set that are not in the acting set for completing writes. when thinking through it in my head it seemed reasonable but i could also see pitfalls with doing it. thats why i was asking if it was possible.

> the only thing holding things together right now is a while loop doing an 'ceph osd down 41' every minute

As far as I understand this disturbs the backfilling and further delays writes to that poor PG.

it definitely does seem to have an impact similar to that. the only upside is that it clears the slow io messages though i don't know if it actually lets the client io complete. recovery doesn't make any progress though in between the down commands. its not making any progress on its own anyways. 

i went to check things this morning and noticed that the number of objects misplaced had dropped from what i was expecting and was occasionally seeing lines from ceph -w saying a number of objects were recovering. the only PG in a state other than active+clean was the one that 41 and 148 were bickering about so it looks like they were now passing traffic. it appeared to start just after on of the osd down events that was happening in the loop i had running. a little while after the backfill started making progress, it completed. so its fine now. i would still like to try and find out the cause since this has happened twice now. but at least its not an emergency for me at the moment.

one other thing that was odd was that i saw the misplaced objects go negative during the backfill. this is one of the lines from ceph -w.

2016-04-29 10:38:15.011241 mon.0 [INF] pgmap v27055697: 6144 pgs: 6143 active+clean, 1 active+undersized+degraded+remapped+backfilling; 123 TB data, 372 TB used, 304 TB / 691 TB avail; 130 MB/s rd, 135 MB/s wr, 11210 op/s; 14547/93845634 objects degraded (0.016%); -13959/93845634 objects misplaced (-0.015%); 27358 kB/s, 7 objects/s recovering

it seemed to complete around the point where it got to -14.5k misplaced. i'm guessing this is just a reporting error but i immediately started a deep-scrub on the pg just to make sure things are consistent.

mike
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com