Re: Cluster unusable

Loic Dachary <loic@xxxxxxxxxxx> · Tue, 23 Dec 2014 11:02:06 +0100

Hi François,

Could you paste somewhere the output of ceph report to check the pg dump ? (it's probably going to be a little too big for the mailing list). You can bring back osd.0 and osd.4 into the host to which they belong (instead of being at the root of the crush map) with crush set:

http://ceph.com/docs/master/rados/operations/crush-map/#add-move-an-osd

They won't be used by the ruleset 0 because they are not under the "default" bucket. To make sure this happens automagically, you may consider using osd_crush_update_on_start=true :

http://ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook
http://workbench.dachary.org/ceph/ceph/blob/firefly/src/upstart/ceph-osd.conf#L18

Cheers

On 23/12/2014 09:56, Francois Petit wrote:
> Hi,
> 
> We use Ceph 0.80.7 for our IceHouse PoC.
> 3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total.
> 4 pools for RBD, size=2,  512 PGs per pool
> 
> Everything was fine until mid of last week, and here's what happened:
> - OSD node #12 passed away
> - AFAICR, ceph recovered fine
> - I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster
> - it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid "-3.052e-05".
> - I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs
> - ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening
> - on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg "map eXXXXX wrongly marked me down" every minute)
> - I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue
> - ceph kept on reorganising PGs and reached this unhealthy state:
> --- 900 PGs stuck unclean
> --- some 'requests are blocked > 32 sec'
> --- command 'rbd info images/<image_id> hung
> --- all tested VMs hung
> - So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs
> - ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly)
> - but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck.
> 
> 
> Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this "PoC sev1"  issue. So I hope that the community can help me find a solution before Christmas.
> 
> 
> 
> Here are the details (actual host and DC names not shown in these outputs).
> 
> [root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done
> Tue Dec 23 06:53:15 GMT 2014
> 0dde9837-3e45-414d-a2c5-902adee0cfe9
> 
> <no reply for 2 hours, still ongoing...>
> 
> [root@MON ]# rbd ls images | head -5
> 0dde9837-3e45-414d-a2c5-902adee0cfe9
> 2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
> 3917346f-12b4-46b8-a5a1-04296ea0a826
> 4bde285b-28db-4bef-99d5-47ce07e2463d
> 7da30b4c-4547-4b4c-a96e-6a3528e03214
> [root@MON ]#
> 
> [cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
> -rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
> [cloud-user@francois-vm2 ~]$ rm /tmp/file
> 
> <no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000'>
> 
> 
> [root@MON ~]# ceph -s
>     cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
>      health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked > 32 sec; noscrub,nodeep-scrub flag(s) set
>      monmap e6: 3 mons at {<MON01>=10.60.9.11:6789/0,<MON06>=10.60.9.16:6789/0,<MON09>=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 <MON01>,<MON06>,<MON09>
>      osdmap e42050: 6 osds: 6 up, 6 in
>             flags noscrub,nodeep-scrub
>       pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
>             600 GB used, 1031 GB / 1632 GB avail
>                    2 inactive
>                 2045 active+clean
>                    1 remapped+peering
>   client io 818 B/s wr, 0 op/s
> 
> [root@MON ~]# ceph health detail
> HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked > 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set
> pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1]
> pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1]
> pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0]
> pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1]
> pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1]
> pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0]
> pg 5.b3 is remapped+peering, acting [1,0]
> 87 ops are blocked > 67108.9 sec
> 16 ops are blocked > 33554.4 sec
> 84 ops are blocked > 67108.9 sec on osd.1
> 16 ops are blocked > 33554.4 sec on osd.1
> 3 ops are blocked > 67108.9 sec on osd.2
> 2 osds have slow requests
> noscrub,nodeep-scrub flag(s) set
> 
> 
> [root@MON]# ceph osd tree
> # id weight type name up/down reweight
> -1 1.08 root default
> -5 0.54 datacenter dc_TWO
> -2 0.54 host node10
> 1 0.27 osd.1 up 1
> 5 0.27 osd.5 up 1
> -4 0 host node12
> -6 0.54 datacenter dc_ONE
> -3 0.54 host node11
> 2 0.27 osd.2 up 1
> 3 0.27 osd.3 up 1
> 0 0 osd.0 up 1
> 4 0 osd.4 up 1
> 
> (I'm concerned about the above two "ghost" osd.0 and osd.4...)
> 
> 
> 
> [root@MON]# ceph osd dump
> epoch 42050
> fsid f0e3957f-1df5-4e55-baeb-0b2236ff6e03
> created 2014-09-02 13:29:11.352712
> modified 2014-12-22 16:43:22.295253
> flags noscrub,nodeep-scrub
> pool 3 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 5018 flags hashpspool stripe_width 0
> removed_snaps [1~7,a~1,c~5]
> pool 4 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 5015 flags hashpspool stripe_width 0
> removed_snaps [1~5,7~c,14~8,1e~2]
> pool 5 'ephemeral' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1553 flags hashpspool stripe_width 0
> pool 6 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 2499 flags hashpspool stripe_width 0
> removed_snaps [1~5]
> max_osd 8
> osd.0 up   in  weight 1 up_from 40904 up_thru 41379 down_at 40899 last_clean_interval [5563,40902) 10.60.9.22:6800/4527 10.60.9.22:6801/4137004527
> 
> 10.60.9.22:6811/4137004527 10.60.9.22:6812/4137004527 exists,up 1dea8553-d3fc-4a45-9706-3136104b935e
> osd.1 up   in  weight 1 up_from 4128 up_thru 42049 down_at 4024 last_clean_interval [3247,4006) 10.60.9.20:6800/2062 10.60.9.20:6801/2062 10.60.9.20:6802/2062
> 
> 10.60.9.20:6803/2062 exists,up f47dea5a-6742-4749-956e-818ff7cb91b4
> osd.2 up   in  weight 1 up_from 40750 up_thru 42048 down_at 40743 last_clean_interval [2950,40742) 10.60.9.21:6808/1141 10.60.9.21:6809/1141 10.60.9.21:6810/1141
> 
> 10.60.9.21:6811/1141 exists,up 87c71251-df5b-48c9-8737-e1c609722a3f
> osd.3 up   in  weight 1 up_from 40750 up_thru 42039 down_at 40745 last_clean_interval [3998,40745) 10.60.9.21:6801/967 10.60.9.21:6804/967 10.60.9.21:6805/967
> 
> 10.60.9.21:6806/967 exists,up 6ae95d34-81ae-4e3d-9af2-17886414295f
> osd.4 up   in  weight 1 up_from 40905 up_thru 41426 down_at 40902 last_clean_interval [5575,40903) 10.60.9.22:6805/5375 10.60.9.22:6802/4153005375
> 
> 10.60.9.22:6803/4153005375 10.60.9.22:6810/4153005375 exists,up dca9f2b2-66cd-406a-9d8a-50ff91b8e4d2
> osd.5 up   in  weight 1 up_from 40350 up_thru 42047 down_at 40198 last_clean_interval [3317,40283) 10.60.9.20:6805/19439 10.60.9.20:6810/1019439 10.60.9.20:6811/1019439
> 
> 10.60.9.20:6812/1019439 exists,up 0ea4ce0a-f74c-4a2a-9fa5-c7b55373bc86
> pg_temp 5.b3 [1,0]
> 
> 
> Again, I'm concerned about the osd.0 and osd.4 which appear as up.
> However these commands succeeded yesterday:
> [root@MON ~]# date;time ceph osd down 0
> Mon Dec 22 15:59:31 UTC 2014
> marked down osd.0.
> 
> real 0m1.264s
> user 0m0.192s
> sys 0m0.031s
> [root@MON ~]# date;time ceph osd down 4
> Mon Dec 22 15:59:35 UTC 2014
> marked down osd.4.
> 
> real 0m0.351s
> user 0m0.193s
> sys 0m0.028s
> 
> 
> The PG map keeps changing, but the state (ceph -s) is still the same. Here is an excerpt of the log.
> [root@MON]# tail -5 /var/log/ceph/ceph.log
> 2014-12-23 08:24:48.585052 mon.0 10.60.9.11:6789/0 1209178 : [INF] pgmap v3291074: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s
> 2014-12-23 08:24:52.201230 mon.0 10.60.9.11:6789/0 1209179 : [INF] pgmap v3291075: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s
> 2014-12-23 08:24:55.895255 mon.0 10.60.9.11:6789/0 1209180 : [INF] pgmap v3291076: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 560 B/s wr, 0 op/s
> 2014-12-23 08:24:58.583940 mon.0 10.60.9.11:6789/0 1209181 : [INF] pgmap v3291077: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB  used, 1031 GB / 1632 GB avail; 641 B/s wr, 0 op/s
> 2014-12-23 08:25:02.206420 mon.0 10.60.9.11:6789/0 1209182 : [INF] pgmap v3291078: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 1297 B/s wr, 0 op/s
> 
> Apart from the PG map change, here are the other last messages:
> [root@MON]# grep -v "2 inactive, 2045 active+clean, 1 remapped+peering" /var/log/ceph/ceph.log |tail -5
> 2014-12-23 06:50:37.237534 osd.1 10.60.9.20:6800/2062 16347 : [WRN] slow request 30720.090953 seconds old, received at 2014-12-22 22:18:37.146491: osd_op
> 
> (client.5021916.0:64428 rbd_data.1bcae02ae8944a.0000000000000510 [sparse-read 3321344~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
> 2014-12-23 06:50:37.237541 osd.1 10.60.9.20:6800/2062 16348 : [WRN] slow request 30720.093197 seconds old, received at 2014-12-22 22:18:37.144247: osd_op
> 
> (client.3324797.0:679739 rbd_data.3fdb9c2ae8944a.000000000000030e [sparse-read 3554816~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
> 2014-12-23 07:00:38.469599 osd.1 10.60.9.20:6800/2062 16349 : [WRN] 100 slow requests, 2 included below; oldest blocked for > 54130.968782 secs
> 2014-12-23 07:00:38.471314 osd.1 10.60.9.20:6800/2062 16350 : [WRN] slow request 30720.750831 seconds old, received at 2014-12-22 22:28:37.718682: osd_op
> 
> (client.5021916.0:64967 rbd_data.1bcae02ae8944a.0000000000000510 [sparse-read 3329536~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
> 2014-12-23 07:00:38.471326 osd.1 10.60.9.20:6800/2062 16351 : [WRN] slow request 30720.750807 seconds old, received at 2014-12-22 22:28:37.718706: osd_op
> 
> (client.3324797.0:679750 rbd_data.3fdb9c2ae8944a.000000000000030e [sparse-read 2768384~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
> [root@MON]#
> 
> 
> The RBD image for the sampled VM, with its hung IOs, looks accessible, (but the same command against its parent hangs):
> 
> [root@MON]# date;time rbd info volumes/volume-2e989ca0-b620-42ca-a16f-e218aea32000
> Tue Dec 23 08:27:13 GMT 2014
> rbd image 'volume-2e989ca0-b620-42ca-a16f-e218aea32000':
> size 6144 MB in 768 objects
> order 23 (8192 kB objects)
> block_name_prefix: rbd_data.412bb450fdfb09
> format: 2
> features: layering
> parent: images/80a2e4e0-0a26-4c00-8783-5530dc914719@snap
> overlap: 6144 MB
> 
> real 0m0.098s
> user 0m0.018s
> sys 0m0.009s
> 
> 
> 
> 
> The CRUSH MAP:
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> 
> # devices
> device 0 device0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 device4
> device 5 osd.5
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host node10 {
> id -2 # do not change unnecessarily
> # weight 0.540
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 0.270
> item osd.5 weight 0.270
> }
> host node12 {
> id -4 # do not change unnecessarily
> # weight 0.000
> alg straw
> hash 0 # rjenkins1
> }
> datacenter dc_TWO {
> id -5 # do not change unnecessarily
> # weight 0.540
> alg straw
> hash 0 # rjenkins1
> item node10 weight 0.540
> item node12 weight 0.000
> }
> host node11 {
> id -3 # do not change unnecessarily
> # weight 0.540
> alg straw
> hash 0 # rjenkins1
> item osd.2 weight 0.270
> item osd.3 weight 0.270
> }
> datacenter dc_ONE {
> id -6 # do not change unnecessarily
> # weight 0.540
> alg straw
> hash 0 # rjenkins1
> item node11 weight 0.540
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 1.080
> alg straw
> hash 0 # rjenkins1
> item dc_TWO weight 0.540
> item dc_ONE weight 0.540
> }
> 
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule DRP {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type datacenter
> step emit
> }
> 
> # end crush map
> 
> 
> 
> Francois.
> 
> <http://www.airfrance.com>
> --
> 
> Accédez aux meilleurs tarifs Air France, gérez vos réservations et enregistrez-vous en ligne sur http://www.airfrance.com <http://www.airfrance.com>
> Find best Air France fares, manage your reservations and check in online at http://www.airfrance.com <http://www.airfrance.com>
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Les données et renseignements contenus dans ce message peuvent être de nature confidentielle et soumis au secret professionnel et sont destinés à l'usage exclusif du destinataire dont les coordonnées figurent ci-dessus. Si vous recevez cette communication par erreur, nous vous demandons de ne pas la copier, l'utiliser ou la divulguer. Nous vous prions de notifier cette erreur à l'expéditeur et d'effacer immédiatement cette communication de votre système. Société Air France - Société anonyme au capital de 126 748 775 euros - RCS Bobigny (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX
> The data and information contained in this message may be confidential and subject to professional secrecy and are intended for the exclusive use of the recipient at the address shown above. If you receive this message by mistake, we ask you not to copy, use or disclose it. Please notify this error to the sender immediately and delete this message from your system. Société Air France - Limited company with capital of 126,748,775 euros - Bobigny register of companies (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Pensez à l'environnement avant d'imprimer ce message.
> Think of the environment before printing this mail.
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com