Re: all pgs of erasure coded pool stuck stale

Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> · Mon, 16 Nov 2015 13:24:58 +0100

On 13/11/15 19:14, Gregory Farnum wrote:
Somebody else will need to do the diagnosis, but it'll help them if
you can get logs with "debug ms = 1", "debug osd = 20" in the log.

Based on the required features update in the crush map, it looks like
maybe you've upgraded some of your OSDs — is that a thing happening
right now? Perhaps you upgraded some of your OSDs, but not the ones
that just rebooted, and when they went down the cluster upgraded its
required feature set?
Hi Greg,
We weren't upgrading the OSDs, but because you mentioned the crushmap, I 
checked if something changed there..
We are building the crushmap ourselves from config, and due to a faulty 
refactoring in our config I wasn't aware of, 'chooseleaf' was changed to 
'choose' in the rule for the ECpool.
Switching this back made our OSDS go in replay and eventually become 
clean again.

Thank you very much!

Cheers,
Kenneth

-Greg

On Fri, Nov 13, 2015 at 8:12 AM, Kenneth Waegeman
<kenneth.waegeman@xxxxxxxx> wrote:
Hi all,

What could be the reason that all pgs of a whole Erasure Coded pool are
stuck stale? All OSDS are restarted and up..

The details:
We have a setup with 14 OSD hosts with specific OSDs for an Erasure coded
pool and 2 SSDS for a cache pool, and 3 seperate monitor/metadata nodes with
ssds for the metadata pool

This afternoon I had to reboot some OSD nodes, because they weren't
reachable anymore. After the cluster recovered, some pgs were stuck stale. I
saw with `health detail` that it were all the pgs of 2 specific EC-pool
osds. I tried with restarting them, but that didn't solve the problem. I
restarted all osds on those nodes, but now all pgs on the osds for EC on
that node were stuck stale. I read in the doc that this state is reached
when it is not communicating with the monitors, so I restarted the monitors.
Since that did not solve it, I tried to restart everything.

When the cluster was recovered again, all other PGs are back active+clean,
except for the pgs in the EC pool, those are still stale+active+clean or
even stale+active+clean+scrubbing+deep

When I try to query such a pg (eg. `ceph pg 2.1b0 query`), it just hangs
there.. That is not the case for the other pools
If I interrupt, I get: Error EINTR: problem getting command descriptions
from pg.2.1b0

I can't see anything strange in the logs of these pgs (attached)

Someone an idea?

Help very much appreciated!

Thanks!

Kenneth

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com