Re: Bug in OSD Maps

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 26 May 2017 21:53:02 +0000

On Fri, May 26, 2017 at 3:05 AM Stuart Harland <s.harland@xxxxxxxxxxxxxxxxxxxxxx> wrote:
Could you elaborate about what constitutes deleting the PG in this instance, is a simple `rm` of the directories with the PG number in current sufficient? or does it need some poking of anything else?

No, you need to look at how to use the ceph-objectstore tool. Just removing the directories will leave associated metadata behind in leveldb/rocksdb.

It is conceivable that there is a fault with the disks, they are known to be ‘faulty’ in the general sense that they suffer a cliff-edge Perf issue, however I’m somewhat confused about why this would suddenly happen in the way it has been detected.

Yeah, not sure. It might just be that the restarting is newly exposing old issues, but I don't see how. I gather from skimming that ticket that it was a disk state bug earlier on that was going undetected until Jewel, which is why I was wondering about the upgrades.
-Greg

We are past early life failures, most of these disks don’t appear to have any significant issues in their smart data to indicate that any write failures are occurring, and I haven’t seen this error once until a couple of weeks ago (we’ve been operating this cluster over 2 years now).

The only versions I’m seeing running (just double checked) currently are 10.2.5,6 and 7. There was one node that had hammer running on it a while back, but it’s been running jewel for months now, so I doubt it’s related to that.

On 26 May 2017, at 00:22, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com