> Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <wido@xxxxxxxx>: > > > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <sage@xxxxxxxxxxxx>: > > > > > > On Wed, 26 Oct 2016, Dan van der Ster wrote: > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > > > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <dan@xxxxxxxxxxxxxx>: > > > >> > > > >> > > > >> Hi Wido, > > > >> > > > >> This seems similar to what our dumpling tunables cluster does when a few > > > >> particular osds go down... Though in our case the remapped pgs are > > > >> correctly shown as remapped, not clean. > > > >> > > > >> The fix in our case will be to enable the vary_r tunable (which will move > > > >> some data). > > > >> > > > > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet. > > > > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists. > > > > > > > > I'll report back later, but this rebalance will take a lot of time. > > > > > > I forgot to mention, a workaround for the vary_r issue is to simply > > > remove the down/out osd from the crush map. We just hit this issue > > > again last night on a failed osd and after removing it from the crush > > > map the last degraded PG started backfilling. > > > > Also note that if you do enable vary_r, you can set it to a higher value > > (like 5) to get the benefit without moving as much existing data. See the > > CRUSH tunable docs for more details! > > > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take. > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are: > > - vary_r to 1: 73% misplaced > - vary_r to 2 ~ 4: Looking into it > - Removing dead OSDs from CRUSH > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome. > > Thanks again for the input and I'll report on what we did later on. > So, what I did: - Remove all dead OSDs from the CRUSHMap and OSDMap - Set vary_r to 2 This resulted in: osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs pg_temp 4.39 [160,17,10,8] pg_temp 4.2c9 [164,95,10,7] pg_temp 4.816 [167,147,57,2] pg_temp 4.862 [31,160,138,2] pg_temp 4.a83 [156,83,10,7] pg_temp 4.e8e [164,78,10,8] In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap. root@mon1:~# ceph osd metadata 2 Error ENOENT: osd.2 does not exist root@mon1:~# ceph osd metadata 10 Error ENOENT: osd.10 does not exist root@mon1:~# ceph osd find 2 Error ENOENT: osd.2 does not exist root@mon1:~# ceph osd find 10 Error ENOENT: osd.10 does not exist root@mon1:~# Looking at PG '4.39' for example, a query tells me: "up": [ 160, 17, 8 ], "acting": [ 160, 17, 8 ], So I really wonder there the pg_temp with osd.10 comes from. Setting vary_r to 1 will result in a 76% degraded state for the cluster and I'm trying to avoid that (for now). I restarted the Primary OSDs for all the affected PGs, but that didn't help either. Any bright ideas on how to fix this? Wido > Wido > > > sage > > > > > > > > > > Cheers, Dan > > > > > > > > > > > > > > Wido > > > > > > > >> Cheers, Dan > > > >> > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <wido@xxxxxxxx> wrote: > > > >> > > > > >> > Hi, > > > >> > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29 > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean. > > > >> > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs > > > >> > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects > > > >> > 264 TB used, 184 TB / 448 TB avail > > > >> > 6144 active+clean > > > >> > > > > >> > The OSDMap shows: > > > >> > > > > >> > root@mon1:~# ceph osd dump|grep pg_temp > > > >> > pg_temp 4.39 [160,17,10,8] > > > >> > pg_temp 4.52 [161,16,10,11] > > > >> > pg_temp 4.8b [166,29,10,7] > > > >> > pg_temp 4.b1 [5,162,148,2] > > > >> > pg_temp 4.168 [95,59,6,2] > > > >> > pg_temp 4.1ef [22,162,10,5] > > > >> > pg_temp 4.2c9 [164,95,10,7] > > > >> > pg_temp 4.330 [165,154,10,8] > > > >> > pg_temp 4.353 [2,33,18,54] > > > >> > pg_temp 4.3f8 [88,67,10,7] > > > >> > pg_temp 4.41a [30,59,10,5] > > > >> > pg_temp 4.45f [47,156,21,2] > > > >> > pg_temp 4.486 [138,43,10,7] > > > >> > pg_temp 4.674 [59,18,7,2] > > > >> > pg_temp 4.7b8 [164,68,10,11] > > > >> > pg_temp 4.816 [167,147,57,2] > > > >> > pg_temp 4.829 [82,45,10,11] > > > >> > pg_temp 4.843 [141,34,10,6] > > > >> > pg_temp 4.862 [31,160,138,2] > > > >> > pg_temp 4.868 [78,67,10,5] > > > >> > pg_temp 4.9ca [150,68,10,8] > > > >> > pg_temp 4.a83 [156,83,10,7] > > > >> > pg_temp 4.a98 [161,94,10,7] > > > >> > pg_temp 4.b80 [162,88,10,8] > > > >> > pg_temp 4.d41 [163,52,10,6] > > > >> > pg_temp 4.d54 [149,140,10,7] > > > >> > pg_temp 4.e8e [164,78,10,8] > > > >> > pg_temp 4.f2a [150,68,10,6] > > > >> > pg_temp 4.ff3 [30,157,10,7] > > > >> > root@mon1:~# > > > >> > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the > > > >> state. > > > >> > > > > >> > root@mon1:~# ceph pg 4.39 query > > > >> > { > > > >> > "state": "active+clean", > > > >> > "snap_trimq": "[]", > > > >> > "epoch": 111212, > > > >> > "up": [ > > > >> > 160, > > > >> > 17, > > > >> > 8 > > > >> > ], > > > >> > "acting": [ > > > >> > 160, > > > >> > 17, > > > >> > 8 > > > >> > ], > > > >> > "actingbackfill": [ > > > >> > "8", > > > >> > "17", > > > >> > "160" > > > >> > ], > > > >> > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I > > > >> tried marking it as down again, but that didn't work. > > > >> > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will > > > >> trigger a rather large rebalance. > > > >> > > > > >> > This cluster is still running with the Dumpling tunables though, so that > > > >> might be the issue. But before I trigger a very large rebalance I wanted to > > > >> check if there are any insights on this one. > > > >> > > > > >> > Thanks, > > > >> > > > > >> > Wido > > > >> > _______________________________________________ > > > >> > ceph-users mailing list > > > >> > ceph-users@xxxxxxxxxxxxxx > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com