inconsistent pgs

greg@xxxxxxxxxxx (Gregory Farnum) · Mon, 7 Jul 2014 16:28:41 -0700

On Mon, Jul 7, 2014 at 4:21 PM, James Harper <james at ejbdigital.com.au> wrote:
>>
>> Okay. Based on your description I think the reason for the tunables
>> crashes is that either the "out" OSDs, or possibly one of the
>> monitors, never got restarted. You should be able to update the
>> tunables now, if you want to. (Or there's also a config option that
>> will disable the warning; check the release notes.)
>
> There was never a monitor on the node with the 'out' OSDs. And even if I forgot to restart the OSD's, they definitely got restarted once things got crashy, although maybe it was too late by then?

Yeah, that's probable.

>> As for why the MDSes (plural? if you have multiple, be aware that's
>> less stable than a single MDS) were blocked, you might want to check
>> your CRUSH map and make sure it's segregating replicas across hosts.
>> I'm betting you knocked out the only copies of some of your PGs.
>
> Yeah I had a question about that. In a setup with 3 (was 4) nodes with 2 OSD's on each, why are there a very small number of pg's that only exist on one node? That kind of defeats the purpose. I haven't checked that that's still the case after the migration is all completed, and maybe it was an artefact of the tunables change, but taking one node out completely for a reboot definitely results in 'not found' pg's.

It sounds like maybe you've got a bad CRUSH map if you're seeing that.
One of the things the tunables do is make the algorithm handle a
variety of maps better, but if PGs are only mapping to one OSD you
need to fix that.

> And are you saying that when I took the 2 OSD's on one node 'out' that some pg's were now inaccessible, even though the OSD's with the pg's on them were still running (and that there should have been other OSDs with replicas)? My setup is with 2 replicas.

That's what I'm guessing. An "out" PG cannot be used to serve client
IO, so if there were (improperly) no other replicas elsewhere (as you
just said is the case), the MDS would need to wait for the PG to
migrate before it could do IO against it. (Strictly speaking it
doesn't need to wait for the whole PG, just "enough" of it, but there
are a bunch of throttles that probably prevented even that minimal
amount of data from being moved over while other PGs were backfilled.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com