Cluster unusable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We use Ceph 0.80.7 for our IceHouse PoC.
3 MONs, 3 OSD nodes (ids 10,11,12) with 2 OSDs each, 1.5TB of storage, total.
4 pools for RBD, size=2,  512 PGs per pool

Everything was fine until mid of last week, and here's what happened:
- OSD node #12 passed away
- AFAICR, ceph recovered fine
- I installed a fresh new node #12 (which inadvertently erased its 2 attached OSDs), and used ceph-deploy to make the node and its 2 OSDs join the cluster
- it was looking okay, except that the weight for the 2 OSDs (osd.0 and osd.4) was a solid "-3.052e-05".
- I applied the workaround from http://tracker.ceph.com/issues/9998 : 'ceph osd crush reweight' on both OSDs
- ceph was then busy redistributing PGs on the 6 OSDs. This was on Friday evening
- on Monday morning (yesterday), ceph was still busy. Actually the two new OSDs were flapping (msg "map eXXXXX wrongly marked me down" every minute)
- I found the root cause was the firewall on node #12. I opened tcp ports 6789-6900 and this solved the flapping issue
- ceph kept on reorganising PGs and reached this unhealthy state:
--- 900 PGs stuck unclean
--- some 'requests are blocked > 32 sec'
--- command 'rbd info images/<image_id> hung
--- all tested VMs hung
- So I tried this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/032929.html, and removed the 2 new OSDs
- ceph again started rebalancing data, and things were looking better (VMs responding, although pretty slowly)
- but at the end, which is the current state, the cluster was back to an unhealthy state, and our PoC is stuck.


Fortunately, the PoC users are out for Christmas. I'm here until Wed 4pm UTC+1 and then back on Jan 5. So there are around 30 hours left for solving this "PoC sev1"  issue. So I hope that the community can help me find a solution before Christmas.



Here are the details (actual host and DC names not shown in these outputs).

[root@MON ~]# date;for im in $(rbd ls images);do echo $im;time rbd info images/$im;done
Tue Dec 23 06:53:15 GMT 2014
0dde9837-3e45-414d-a2c5-902adee0cfe9

<no reply for 2 hours, still ongoing...>

[root@MON ]# rbd ls images | head -5
0dde9837-3e45-414d-a2c5-902adee0cfe9
2b62a79c-bdbc-43dc-ad88-dfbfaa9d005e
3917346f-12b4-46b8-a5a1-04296ea0a826
4bde285b-28db-4bef-99d5-47ce07e2463d
7da30b4c-4547-4b4c-a96e-6a3528e03214
[root@MON ]#

[cloud-user@francois-vm2 ~]$ ls -lh /tmp/file
-rw-rw-r--. 1 cloud-user cloud-user 552M Dec 22 22:19 /tmp/file
[cloud-user@francois-vm2 ~]$ rm /tmp/file

<no reply for 1 hour, still ongoing. The RBD image used by that VM is 'volume-2e989ca0-b620-42ca-a16f-e218aea32000'>


[root@MON ~]# ceph -s
    cluster f0e3957f-1df5-4e55-baeb-0b2236ff6e03
     health HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked > 32 sec; noscrub,nodeep-scrub flag(s) set
     monmap e6: 3 mons at {<MON01>=10.60.9.11:6789/0,<MON06>=10.60.9.16:6789/0,<MON09>=10.60.9.19:6789/0}, election epoch 1338, quorum 0,1,2 <MON01>,<MON06>,<MON09>
     osdmap e42050: 6 osds: 6 up, 6 in
            flags noscrub,nodeep-scrub
      pgmap v3290710: 2048 pgs, 4 pools, 301 GB data, 58987 objects
            600 GB used, 1031 GB / 1632 GB avail
                   2 inactive
                2045 active+clean
                   1 remapped+peering
  client io 818 B/s wr, 0 op/s

[root@MON ~]# ceph health detail
HEALTH_WARN 1 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; 103 requests are blocked > 32 sec; 2 osds have slow requests; noscrub,nodeep-scrub flag(s) set
pg 5.a7 is stuck inactive for 54776.026394, current state inactive, last acting [2,1]
pg 5.ae is stuck inactive for 54774.738938, current state inactive, last acting [2,1]
pg 5.b3 is stuck inactive for 71579.365205, current state remapped+peering, last acting [1,0]
pg 5.a7 is stuck unclean for 299118.648789, current state inactive, last acting [2,1]
pg 5.ae is stuck unclean for 286227.592617, current state inactive, last acting [2,1]
pg 5.b3 is stuck unclean for 71579.365263, current state remapped+peering, last acting [1,0]
pg 5.b3 is remapped+peering, acting [1,0]
87 ops are blocked > 67108.9 sec
16 ops are blocked > 33554.4 sec
84 ops are blocked > 67108.9 sec on osd.1
16 ops are blocked > 33554.4 sec on osd.1
3 ops are blocked > 67108.9 sec on osd.2
2 osds have slow requests
noscrub,nodeep-scrub flag(s) set


[root@MON]# ceph osd tree
# id weight type name up/down reweight
-1 1.08 root default
-5 0.54 datacenter dc_TWO
-2 0.54 host node10
1 0.27 osd.1 up 1
5 0.27 osd.5 up 1
-4 0 host node12
-6 0.54 datacenter dc_ONE
-3 0.54 host node11
2 0.27 osd.2 up 1
3 0.27 osd.3 up 1
0 0 osd.0 up 1
4 0 osd.4 up 1

(I'm concerned about the above two "ghost" osd.0 and osd.4...)



[root@MON]# ceph osd dump
epoch 42050
fsid f0e3957f-1df5-4e55-baeb-0b2236ff6e03
created 2014-09-02 13:29:11.352712
modified 2014-12-22 16:43:22.295253
flags noscrub,nodeep-scrub
pool 3 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 5018 flags hashpspool stripe_width 0
removed_snaps [1~7,a~1,c~5]
pool 4 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 5015 flags hashpspool stripe_width 0
removed_snaps [1~5,7~c,14~8,1e~2]
pool 5 'ephemeral' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1553 flags hashpspool stripe_width 0
pool 6 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 2499 flags hashpspool stripe_width 0
removed_snaps [1~5]
max_osd 8
osd.0 up   in  weight 1 up_from 40904 up_thru 41379 down_at 40899 last_clean_interval [5563,40902) 10.60.9.22:6800/4527 10.60.9.22:6801/4137004527

10.60.9.22:6811/4137004527 10.60.9.22:6812/4137004527 exists,up 1dea8553-d3fc-4a45-9706-3136104b935e
osd.1 up   in  weight 1 up_from 4128 up_thru 42049 down_at 4024 last_clean_interval [3247,4006) 10.60.9.20:6800/2062 10.60.9.20:6801/2062 10.60.9.20:6802/2062

10.60.9.20:6803/2062 exists,up f47dea5a-6742-4749-956e-818ff7cb91b4
osd.2 up   in  weight 1 up_from 40750 up_thru 42048 down_at 40743 last_clean_interval [2950,40742) 10.60.9.21:6808/1141 10.60.9.21:6809/1141 10.60.9.21:6810/1141

10.60.9.21:6811/1141 exists,up 87c71251-df5b-48c9-8737-e1c609722a3f
osd.3 up   in  weight 1 up_from 40750 up_thru 42039 down_at 40745 last_clean_interval [3998,40745) 10.60.9.21:6801/967 10.60.9.21:6804/967 10.60.9.21:6805/967

10.60.9.21:6806/967 exists,up 6ae95d34-81ae-4e3d-9af2-17886414295f
osd.4 up   in  weight 1 up_from 40905 up_thru 41426 down_at 40902 last_clean_interval [5575,40903) 10.60.9.22:6805/5375 10.60.9.22:6802/4153005375

10.60.9.22:6803/4153005375 10.60.9.22:6810/4153005375 exists,up dca9f2b2-66cd-406a-9d8a-50ff91b8e4d2
osd.5 up   in  weight 1 up_from 40350 up_thru 42047 down_at 40198 last_clean_interval [3317,40283) 10.60.9.20:6805/19439 10.60.9.20:6810/1019439 10.60.9.20:6811/1019439

10.60.9.20:6812/1019439 exists,up 0ea4ce0a-f74c-4a2a-9fa5-c7b55373bc86
pg_temp 5.b3 [1,0]


Again, I'm concerned about the osd.0 and osd.4 which appear as up.
However these commands succeeded yesterday:
[root@MON ~]# date;time ceph osd down 0
Mon Dec 22 15:59:31 UTC 2014
marked down osd.0.

real 0m1.264s
user 0m0.192s
sys 0m0.031s
[root@MON ~]# date;time ceph osd down 4
Mon Dec 22 15:59:35 UTC 2014
marked down osd.4.

real 0m0.351s
user 0m0.193s
sys 0m0.028s


The PG map keeps changing, but the state (ceph -s) is still the same. Here is an excerpt of the log.
[root@MON]# tail -5 /var/log/ceph/ceph.log
2014-12-23 08:24:48.585052 mon.0 10.60.9.11:6789/0 1209178 : [INF] pgmap v3291074: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s
2014-12-23 08:24:52.201230 mon.0 10.60.9.11:6789/0 1209179 : [INF] pgmap v3291075: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 819 B/s wr, 0 op/s
2014-12-23 08:24:55.895255 mon.0 10.60.9.11:6789/0 1209180 : [INF] pgmap v3291076: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 560 B/s wr, 0 op/s
2014-12-23 08:24:58.583940 mon.0 10.60.9.11:6789/0 1209181 : [INF] pgmap v3291077: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB  used, 1031 GB / 1632 GB avail; 641 B/s wr, 0 op/s
2014-12-23 08:25:02.206420 mon.0 10.60.9.11:6789/0 1209182 : [INF] pgmap v3291078: 2048 pgs: 2 inactive, 2045 active+clean, 1 remapped+peering; 301 GB data, 600 GB used, 1031 GB / 1632 GB avail; 1297 B/s wr, 0 op/s

Apart from the PG map change, here are the other last messages:
[root@MON]# grep -v "2 inactive, 2045 active+clean, 1 remapped+peering" /var/log/ceph/ceph.log |tail -5
2014-12-23 06:50:37.237534 osd.1 10.60.9.20:6800/2062 16347 : [WRN] slow request 30720.090953 seconds old, received at 2014-12-22 22:18:37.146491: osd_op

(client.5021916.0:64428 rbd_data.1bcae02ae8944a.0000000000000510 [sparse-read 3321344~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
2014-12-23 06:50:37.237541 osd.1 10.60.9.20:6800/2062 16348 : [WRN] slow request 30720.093197 seconds old, received at 2014-12-22 22:18:37.144247: osd_op

(client.3324797.0:679739 rbd_data.3fdb9c2ae8944a.000000000000030e [sparse-read 3554816~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
2014-12-23 07:00:38.469599 osd.1 10.60.9.20:6800/2062 16349 : [WRN] 100 slow requests, 2 included below; oldest blocked for > 54130.968782 secs
2014-12-23 07:00:38.471314 osd.1 10.60.9.20:6800/2062 16350 : [WRN] slow request 30720.750831 seconds old, received at 2014-12-22 22:28:37.718682: osd_op

(client.5021916.0:64967 rbd_data.1bcae02ae8944a.0000000000000510 [sparse-read 3329536~4096] 5.7e03aeb3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
2014-12-23 07:00:38.471326 osd.1 10.60.9.20:6800/2062 16351 : [WRN] slow request 30720.750807 seconds old, received at 2014-12-22 22:28:37.718706: osd_op

(client.3324797.0:679750 rbd_data.3fdb9c2ae8944a.000000000000030e [sparse-read 2768384~32768] 5.f2d4a8b3 RETRY=1 ack+retry+read e42050) v4 currently reached pg
[root@MON]#


The RBD image for the sampled VM, with its hung IOs, looks accessible, (but the same command against its parent hangs):

[root@MON]# date;time rbd info volumes/volume-2e989ca0-b620-42ca-a16f-e218aea32000
Tue Dec 23 08:27:13 GMT 2014
rbd image 'volume-2e989ca0-b620-42ca-a16f-e218aea32000':
size 6144 MB in 768 objects
order 23 (8192 kB objects)
block_name_prefix: rbd_data.412bb450fdfb09
format: 2
features: layering
parent: images/80a2e4e0-0a26-4c00-8783-5530dc914719@snap
overlap: 6144 MB

real 0m0.098s
user 0m0.018s
sys 0m0.009s




The CRUSH MAP:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node10 {
id -2 # do not change unnecessarily
# weight 0.540
alg straw
hash 0 # rjenkins1
item osd.1 weight 0.270
item osd.5 weight 0.270
}
host node12 {
id -4 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
datacenter dc_TWO {
id -5 # do not change unnecessarily
# weight 0.540
alg straw
hash 0 # rjenkins1
item node10 weight 0.540
item node12 weight 0.000
}
host node11 {
id -3 # do not change unnecessarily
# weight 0.540
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.270
item osd.3 weight 0.270
}
datacenter dc_ONE {
id -6 # do not change unnecessarily
# weight 0.540
alg straw
hash 0 # rjenkins1
item node11 weight 0.540
}
root default {
id -1 # do not change unnecessarily
# weight 1.080
alg straw
hash 0 # rjenkins1
item dc_TWO weight 0.540
item dc_ONE weight 0.540
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule DRP {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type datacenter
step emit
}

# end crush map



Francois.


--

Accédez aux meilleurs tarifs Air France, gérez vos réservations et enregistrez-vous en ligne sur http://www.airfrance.com
Find best Air France fares, manage your reservations and check in online at http://www.airfrance.com


Les données et renseignements contenus dans ce message peuvent être de nature confidentielle et soumis au secret professionnel et sont destinés à l'usage exclusif du destinataire dont les coordonnées figurent ci-dessus. Si vous recevez cette communication par erreur, nous vous demandons de ne pas la copier, l'utiliser ou la divulguer. Nous vous prions de notifier cette erreur à l'expéditeur et d'effacer immédiatement cette communication de votre système. Société Air France - Société anonyme au capital de 126 748 775 euros - RCS Bobigny (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX
The data and information contained in this message may be confidential and subject to professional secrecy and are intended for the exclusive use of the recipient at the address shown above. If you receive this message by mistake, we ask you not to copy, use or disclose it. Please notify this error to the sender immediately and delete this message from your system. Société Air France - Limited company with capital of 126,748,775 euros - Bobigny register of companies (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX
Pensez à l'environnement avant d'imprimer ce message.
Think of the environment before printing this mail.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux