Space on hosts in rack2 does not add up to cover space in rack1. After enough data are written to the cluster all pgs on rack2 would be allocated and the cluster won't be able to find a free pg to map new data to for the 3rd replica. Bottomline, spread your big disks to all 4 hosts, or add some more disks/OSDs to hosts on rack2. As a last resort, you may decrease the failure domain to 'osd' instead of the default 'host' but that is very dangerous for a production cluster. -K. On 03/24/2016 04:36 PM, yang sheng wrote: > Hi all, > > I am testing the ceph right now using 4 servers with 8 OSDs (all OSDs > are up and in). I have 3 pools in my cluster (image pool, volume pool > and default rbd pool), both image and volume pool have replication size > =3. Based on the pg equation, there are 448 pgs in my cluster. > > $ ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT > PRIMARY-AFFINITY > -1 16.07797 root default > > -5 14.38599 rack rack1 > -2 7.17599 host psusnjhhdlc7iosstb001 > > 0 3.53899 osd.0 up 1.00000 > 1.00000 > 1 3.63699 osd.1 up 1.00000 > 1.00000 > -3 7.20999 host psusnjhhdlc7iosstb002 > > 2 3.63699 osd.2 up 1.00000 > 1.00000 > 3 3.57300 osd.3 up 1.00000 > 1.00000 > -6 1.69199 rack rack2 > -4 0.83600 host psusnjhhdlc7iosstb003 > > 5 0.43500 osd.5 up 1.00000 > 1.00000 > 4 0.40099 osd.4 up 1.00000 > 1.00000 > -7 0.85599 host psusnjhhdlc7iosstb004 > > 6 0.40099 osd.6 up 1.00000 > 0 > 7 0.45499 osd.7 up 1.00000 > 0 > > $ ceph osd dump > pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 745 flags hashpspool > stripe_width 0 > pool 3 'imagesliberty' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 128 pgp_num 128 last_change 777 flags > hashpspool stripe_width 0 > removed_snaps [1~1,8~c] > pool 4 'volumesliberty' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 256 pgp_num 256 last_change 776 flags > hashpspool stripe_width 0 > removed_snaps [1~1,15~14,2a~1,2c~1,2e~24,57~2,5a~18,74~2,78~1,94~5,b7~2] > > > Right now, the ceph health is HEALTH_WARN. I use "ceph health detail" > to dump the information, and there is a pg stuck. > > $ ceph -s: > cluster 2e906379-f211-4329-8faf-a8e7600b8418 > health HEALTH_WARN > 1 pgs degraded > 1 pgs stuck degraded > 1 pgs stuck inactive > 1 pgs stuck unclean > 1 pgs stuck undersized > 1 pgs undersized > recovery 23/55329 objects degraded (0.042%) > monmap e14: 2 mons at > {psusnjhhdlc7ioscom002=192.168.2.62:6789/0,psusnjhhdlc7ioscon002=192.168.2.12:6789/0 > <http://192.168.2.62:6789/0,psusnjhhdlc7ioscon002=192.168.2.12:6789/0>} > election epoch 106, quorum 0,1 > psusnjhhdlc7ioscon002,psusnjhhdlc7ioscom002 > osdmap e776: 8 osds: 8 up, 8 in > flags sortbitwise > pgmap v519644: 448 pgs, 3 pools, 51541 MB data, 18443 objects > 170 GB used, 16294 GB / 16464 GB avail > 23/55329 objects degraded (0.042%) > 447 active+clean > 1 undersized+degraded+peered > > > $ ceph health detail > HEALTH_WARN 1 pgs degraded; 1 pgs stuck unclean; 1 pgs undersized; > recovery 23/55329 objects degraded (0.042%) > pg 3.d is stuck unclean for 58161.177025, current state > active+undersized+degraded, last acting [1,3] > pg 3.d is active+undersized+degraded, acting [1,3] > recovery 23/55329 objects degraded (0.042%) > > If I am right, the pg 3.d has only 2 replicas, primary in OSD.1 and > secondary in OSD.3. There is no 3rd replica in the cluster. That's why > it gives the unhealthy warning. > > I tried to decrease the replication size =2 for image pool and the stuck > pg disappeared. After I change the size back to 3, still the ceph didn't > create the 3rd replica for pg 3.d. > > I also tried to shutdown Server 0 which has OSD.0 and OSD.1 which let pg > d.3 has only 1 replica in the cluster. Still it didn't create another > copy even I set size =3 and min_size=2. Also, there are more pg in > degraded undersized or unclean mode. > > $ ceph pg map 3.d > osdmap e796 pg 3.d (3.d) -> up [3] acting [3] > > $ ceph -s > cluster 2e906379-f211-4329-8faf-a8e7600b8418 > health HEALTH_WARN > 16 pgs degraded > 16 pgs stuck degraded > 2 pgs stuck inactive > 37 pgs stuck unclean > 16 pgs stuck undersized > 16 pgs undersized > recovery 1427/55329 objects degraded (2.579%) > recovery 780/55329 objects misplaced (1.410%) > monmap e14: 2 mons at > {psusnjhhdlc7ioscom002=192.168.2.62:6789/0,psusnjhhdlc7ioscon002=192.168.2.12:6789/0 > <http://192.168.2.62:6789/0,psusnjhhdlc7ioscon002=192.168.2.12:6789/0>} > election epoch 106, quorum 0,1 > psusnjhhdlc7ioscon002,psusnjhhdlc7ioscom002 > osdmap e796: 8 osds: 6 up, 6 in; 21 remapped pgs > flags sortbitwise > pgmap v521445: 448 pgs, 3 pools, 51541 MB data, 18443 objects > 168 GB used, 8947 GB / 9116 GB avail > 1427/55329 objects degraded (2.579%) > 780/55329 objects misplaced (1.410%) > 411 active+clean > 21 active+remapped > 14 active+undersized+degraded > 2 undersized+degraded+peered > > Can anyone advise how fix pg 3.d problem and why ceph couldn't recover > if I shutdown one server (2 OSDs) > > Thanks > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com