Hello, On Tue, 4 Aug 2015 20:33:58 +1000 Daniel Manzau wrote: > Hi Christian, > > True it's not exactly out of the box. Here is the ceph.conf. > Crush rule file and a description (are those 4 hosts or are the HDD and SSD shared on the same HW as your pool size suggests), etc etc. My guess is you're following Sebastien's blog entry on how to mix things on the same host. > Could it be the " osd crush update on start = false" stopping the > remapping of a disk on failure? > Doubt it, that would be a pretty significant bug. OTOH, is your "osd_pool_default_size = 2" matched by a "osd pool default min size = 1" ? As in, is your cluster (or at least the pool using SSDs) totally stuck at this point? Christian > > [global] > fsid = bfb7e666-f66d-45c0-b4fc-b98182fed666 > mon_initial_members = ceph-store1, ceph-store2, ceph-admin1 > mon_host = 10.66.8.2,10.66.8.3,10.66.8.1 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > osd_pool_default_size = 2 > public_network = 10.66.8.0/23 > cluster network = 10.66.16.0/23 > > [osd] > osd crush update on start = false > osd_max_backfills = 2 > osd_recovery_op_priority = 2 > osd_recovery_max_active = 2 > osd_recovery_max_chunk = 4194304 > > [client] > rbd cache = true > rbd cache writethrough until flush = true > admin socket = /var/run/ceph/rbd-client-$pid.asok > > > Regards, > Daniel > > -----Original Message----- > From: Christian Balzer [mailto:chibi@xxxxxxx] > Sent: Tuesday, 4 August 2015 3:47 PM > To: Daniel Manzau > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: PG's Degraded on disk failure not remapped. > > > Hello, > > There's a number of reasons I can think of why this would happen. > You say "default behavior" but looking at your map it's obvious that you > probably don't have a default cluster and crush map. > Your ceph.conf may help, too. > > Regards, > > Christian > On Tue, 4 Aug 2015 13:05:54 +1000 Daniel Manzau wrote: > > > Hi Cephers, > > > > We've been testing drive failures and we're just trying to see if the > > behaviour of our cluster is normal, or if we've setup something wrong. > > > > In summary; the OSD is down and out, but the PGs are showing as > > degraded and don't seem to want to remap. We'd have assumed once the > > OSD was marked out, that a re-map should have happened and we'd see > > misplaced rather than degraded PGs. > > > > cluster bfb7e824-f37d-45c0-a4fc-a98182fed985 > > health HEALTH_WARN > > 43 pgs degraded > > 43 pgs stuck degraded > > 44 pgs stuck unclean > > 43 pgs stuck undersized > > 43 pgs undersized > > recovery 36899/6822836 objects degraded (0.541%) > > recovery 813/6822836 objects misplaced (0.012%) > > monmap e3: 3 mons at > > > {ceph-admin1=10.66.8.1:6789/0,ceph-store1=10.66.8.2:6789/0,ceph-store2=10. > > 66.8.3:6789/0} > > election epoch 950, quorum 0,1,2 > > ceph-admin1,ceph-store1,ceph-store2 > > osdmap e6342: 36 osds: 35 up, 35 in; 1 remapped pgs > > pgmap v11805515: 1700 pgs, 3 pools, 13165 GB data, 3331 kobjects > > 25941 GB used, 30044 GB / 55986 GB avail > > 36899/6822836 objects degraded (0.541%) > > 813/6822836 objects misplaced (0.012%) > > 1656 active+clean > > 43 active+undersized+degraded > > 1 active+remapped > > client io 491 kB/s rd, 3998 kB/s wr, 480 op/s > > > > > > # id weight type name up/down reweight > > -6 43.56 root hdd > > -2 21.78 host ceph-store1-hdd > > 0 3.63 osd.0 up 1 > > 2 3.63 osd.2 up 1 > > 4 3.63 osd.4 up 1 > > 6 3.63 osd.6 up 1 > > 8 3.63 osd.8 up 1 > > 10 3.63 osd.10 up 1 > > -3 21.78 host ceph-store2-hdd > > 1 3.63 osd.1 up 1 > > 3 3.63 osd.3 up 1 > > 5 3.63 osd.5 up 1 > > 7 3.63 osd.7 up 1 > > 9 3.63 osd.9 up 1 > > 11 3.63 osd.11 up 1 > > -1 11.48 root ssd > > -4 5.74 host ceph-store1-ssd > > 12 0.43 osd.12 up 1 > > 13 0.43 osd.13 up 1 > > 14 0.43 osd.14 up 1 > > 16 0.43 osd.16 up 1 > > 18 0.43 osd.18 down 0 > > 19 0.43 osd.19 up 1 > > 20 0.43 osd.20 up 1 > > 21 0.43 osd.21 up 1 > > 32 0.72 osd.32 up 1 > > 33 0.72 osd.33 up 1 > > 17 0.43 osd.17 up 1 > > 15 0.43 osd.15 up 1 > > -5 5.74 host ceph-store2-ssd > > 22 0.43 osd.22 up 1 > > 23 0.43 osd.23 up 1 > > 24 0.43 osd.24 up 1 > > 25 0.43 osd.25 up 1 > > 26 0.43 osd.26 up 1 > > 27 0.43 osd.27 up 1 > > 28 0.43 osd.28 up 1 > > 29 0.43 osd.29 up 1 > > 30 0.43 osd.30 up 1 > > 31 0.43 osd.31 up 1 > > 34 0.72 osd.34 up 1 > > 35 0.72 osd.35 up 1 > > > > Are we misunderstanding the default behaviour? Any help you can > > provide will be very much appreciated. > > > > Regards, > > Daniel > > > > W: www.3ca.com.au > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com