Re: All PGs are active+clean, still remapped PGs

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 2 Nov 2016 15:00:36 +0000 (UTC)

On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > Op 2 november 2016 om 15:06 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> > 
> > 
> > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > 
> > > > Op 2 november 2016 om 14:30 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> > > > 
> > > > 
> > > > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > > > 
> > > > > > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <wido@xxxxxxxx>:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> > > > > > > 
> > > > > > > 
> > > > > > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > > > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <dan@xxxxxxxxxxxxxx>:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Hi Wido,
> > > > > > > > >>
> > > > > > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > > > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > > > > > >> correctly shown as remapped, not clean.
> > > > > > > > >>
> > > > > > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > > > > > >> some data).
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > > > > > >
> > > > > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > > > > > >
> > > > > > > > > I'll report back later, but this rebalance will take a lot of time.
> > > > > > > > 
> > > > > > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > > > > > remove the down/out osd from the crush map. We just hit this issue
> > > > > > > > again last night on a failed osd and after removing it from the crush
> > > > > > > > map the last degraded PG started backfilling.
> > > > > > > 
> > > > > > > Also note that if you do enable vary_r, you can set it to a higher value 
> > > > > > > (like 5) to get the benefit without moving as much existing data.  See the 
> > > > > > > CRUSH tunable docs for more details!
> > > > > > > 
> > > > > > 
> > > > > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> > > > > > 
> > > > > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> > > > > > 
> > > > > > - vary_r to 1: 73% misplaced
> > > > > > - vary_r to 2 ~ 4: Looking into it
> > > > > > - Removing dead OSDs from CRUSH
> > > > > > 
> > > > > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> > > > > > 
> > > > > > Thanks again for the input and I'll report on what we did later on.
> > > > > > 
> > > > > 
> > > > > So, what I did:
> > > > > - Remove all dead OSDs from the CRUSHMap and OSDMap
> > > > > - Set vary_r to 2
> > > > > 
> > > > > This resulted in:
> > > > > 
> > > > > osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
> > > > > 
> > > > > pg_temp 4.39 [160,17,10,8]
> > > > > pg_temp 4.2c9 [164,95,10,7]
> > > > > pg_temp 4.816 [167,147,57,2]
> > > > > pg_temp 4.862 [31,160,138,2]
> > > > > pg_temp 4.a83 [156,83,10,7]
> > > > > pg_temp 4.e8e [164,78,10,8]
> > > > > 
> > > > > In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
> > > > > 
> > > > > root@mon1:~# ceph osd metadata 2
> > > > > Error ENOENT: osd.2 does not exist
> > > > > root@mon1:~# ceph osd metadata 10
> > > > > Error ENOENT: osd.10 does not exist
> > > > > root@mon1:~# ceph osd find 2
> > > > > Error ENOENT: osd.2 does not exist
> > > > > root@mon1:~# ceph osd find 10
> > > > > Error ENOENT: osd.10 does not exist
> > > > > root@mon1:~#
> > > > > 
> > > > > Looking at PG '4.39' for example, a query tells me:
> > > > > 
> > > > >     "up": [
> > > > >         160,
> > > > >         17,
> > > > >         8
> > > > >     ],
> > > > >     "acting": [
> > > > >         160,
> > > > >         17,
> > > > >         8
> > > > >     ],
> > > > > 
> > > > > So I really wonder there the pg_temp with osd.10 comes from.
> > > > 
> > > > Hmm.. are the others also the same like that?  You can manually poke 
> > > > it into adjusting pg-temp with
> > > > 
> > > >  ceph osd pg_temp <pgid> <just the primary osd>
> > > > 
> > > > That'll make peering reevaluate what pg_temp it wants (if any).  It might 
> > > > be that it isn't noticing that pg_temp matches acting.. but the mon has 
> > > > special code to remove those entries, so hrm.  Is this hammer?
> > > > 
> > > 
> > > So yes, that worked. I did it for 3 PGs:
> > > 
> > > # ceph osd pg-temp 4.39 160
> > > # ceph osd pg-temp 4.2c9 164
> > > # ceph osd pg-temp 4.816 167
> > > 
> > > Now my pg_temp looks like:
> > > 
> > > pg_temp 4.862 [31,160,138,2]
> > > pg_temp 4.a83 [156,83,10,7]
> > > pg_temp 4.e8e [164,78,10,8]
> > > 
> > > There we see the osd.2 and osd.10 again. I'm not setting these yet since you might want logs from the MONs or OSDs?
> > > 
> > > This is Hammer 0.94.9
> > 
> > I'm pretty sure this is a race condition that got cleaned up as part of 
> > https://github.com/ceph/ceph/pull/9078/commits.  The mon only checks the 
> > pg_temp entries that are getting set/changed, and since those are already 
> > in place it doesn't recheck them.  Any poke to the cluster that triggers 
> > peering ought to be enough to clear it up.  So, no need for logs, thanks!
> > 
> 
> Ok, just checking.
> 
> > We could add a special check during, say, upgrade, but generally the PGs 
> > will re-peer as the OSDs restart anyway and that will clear it up.
> > 
> > Maybe you can just confirm that marking an osd down (say, ceph osd down 
> > 31) is also enough to remove the stray entry?
> > 
> 
> I already tried a restart of the OSDs, but that didn't work. I marked osd 31, 160 and 138 as down for PG 4.862 but that didn't work:
> 
> pg_temp 4.862 [31,160,138,2]
> 
> But this works:
> 
> root@mon1:~# ceph osd dump|grep pg_temp
> pg_temp 4.862 [31,160,138,2]
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.e8e [164,78,10,8]
> root@mon1:~# ceph osd pg-temp 4.862 31
> set 4.862 pg_temp mapping to [31]
> root@mon1:~# ceph osd dump|grep pg_temp
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.e8e [164,78,10,8]
> root@mon1:~#
> 
> So the restarts nor the marking down fixed the issue. Only the pg-temp trick.
> 
> Still have two PGs left which I can test with.

Hmm.  Did you leave the OSD down long enough for the PG to peer without 
it?  Can you confirm that doesn't work?

Thanks!
s

> 
> Wido
> 
> > Thanks!
> > sage
> > 
> >  > 
> > > > > Setting vary_r to 1 will result in a 76% degraded state for the cluster 
> > > > > and I'm trying to avoid that (for now).
> > > > > 
> > > > > I restarted the Primary OSDs for all the affected PGs, but that didn't 
> > > > > help either.
> > > > > 
> > > > > Any bright ideas on how to fix this?
> > > > 
> > > > This part seems unrelated to vary_r... you shouldn't have to 
> > > > reduce it further!
> > > > 
> > > 
> > > Indeed, like you said, the pg_temp fixed it for 3 PGs already. Holding off with the rest in case you want logs or debug it further.
> > > 
> > > Wido
> > > 
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > > Wido
> > > > > 
> > > > > > Wido 
> > > > > > 
> > > > > > > sage
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Cheers, Dan
> > > > > > > > 
> > > > > > > > 
> > > > > > > > >
> > > > > > > > > Wido
> > > > > > > > >
> > > > > > > > >> Cheers, Dan
> > > > > > > > >>
> > > > > > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <wido@xxxxxxxx> wrote:
> > > > > > > > >> >
> > > > > > > > >> > Hi,
> > > > > > > > >> >
> > > > > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > > > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > > > > > >> >
> > > > > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > > > > > >> >
> > > > > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > > > > > >> >     264 TB used, 184 TB / 448 TB avail
> > > > > > > > >> >         6144 active+clean
> > > > > > > > >> >
> > > > > > > > >> > The OSDMap shows:
> > > > > > > > >> >
> > > > > > > > >> > root@mon1:~# ceph osd dump|grep pg_temp
> > > > > > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > > > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > > > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > > > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > > > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > > > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > > > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > > > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > > > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > > > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > > > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > > > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > > > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > > > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > > > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > > > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > > > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > > > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > > > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > > > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > > > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > > > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > > > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > > > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > > > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > > > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > > > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > > > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > > > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > > > > > >> > root@mon1:~#
> > > > > > > > >> >
> > > > > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > > > > > >> state.
> > > > > > > > >> >
> > > > > > > > >> > root@mon1:~# ceph pg 4.39 query
> > > > > > > > >> > {
> > > > > > > > >> >     "state": "active+clean",
> > > > > > > > >> >     "snap_trimq": "[]",
> > > > > > > > >> >     "epoch": 111212,
> > > > > > > > >> >     "up": [
> > > > > > > > >> >         160,
> > > > > > > > >> >         17,
> > > > > > > > >> >         8
> > > > > > > > >> >     ],
> > > > > > > > >> >     "acting": [
> > > > > > > > >> >         160,
> > > > > > > > >> >         17,
> > > > > > > > >> >         8
> > > > > > > > >> >     ],
> > > > > > > > >> >     "actingbackfill": [
> > > > > > > > >> >         "8",
> > > > > > > > >> >         "17",
> > > > > > > > >> >         "160"
> > > > > > > > >> >     ],
> > > > > > > > >> >
> > > > > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > > > > > >> tried marking it as down again, but that didn't work.
> > > > > > > > >> >
> > > > > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > > > > > >> trigger a rather large rebalance.
> > > > > > > > >> >
> > > > > > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > > > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > > > > > >> check if there are any insights on this one.
> > > > > > > > >> >
> > > > > > > > >> > Thanks,
> > > > > > > > >> >
> > > > > > > > >> > Wido
> > > > > > > > >> > _______________________________________________
> > > > > > > > >> > ceph-users mailing list
> > > > > > > > >> > ceph-users@xxxxxxxxxxxxxx
> > > > > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > 
> > > > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > 
> > > > >
> > > 
> > >
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com