So I'm trying to break my test cluster, and figure out how to put it
back together again. I'm able to fix this, but the behavior seems
strange to me, so I wanted to run it past more experienced people. I'm doing these tests using RadosGW. I currently have 2 nodes, with replication=2. (I haven't gotten to the cluster expansion testing yet). I'm going to upload a file, then simulate a disk failure by deleting some PGs on one of the OSDs. I have seen this mentioned as the way to fix OSDs that filled up during recovery/backfill. I expected the cluster to detect the error, change the cluster health to warn, then return the data from another copy. Instead, I got a 404 error. me@client ~ $ s3cmd ls 2013-06-12 00:02 s3://bucket1 me@client ~ $ s3cmd ls s3://bucket1 2013-06-12 00:02 13 8ddd8be4b179a529afa5f2ffae4b9858 s3://bucket1/hello.txt me@client ~ $ s3cmd put Object1 s3://bucket1 Object1 -> s3://bucket1/Object1 [1 of 1] 400000000 of 400000000 100% in 62s 6.13 MB/s done me@client ~ $ s3cmd ls s3://bucket1 2013-06-13 01:10 381M 15bdad3e014ca5f5c9e5c706e17d65f3 s3://bucket1/Object1 2013-06-12 00:02 13 8ddd8be4b179a529afa5f2ffae4b9858 s3://bucket1/hello.txt So at this point, the cluster is healthy, and we can download objects from RGW. me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_OK monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e44: 2 osds: 2 up, 2 in pgmap v4055: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used, 94406 MB / 102347 MB avail; 17B/s rd, 0op/s mdsmap e1: 0/0/1 up me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download1 s3://bucket1/Object1 -> ./Object.Download1 [1 of 1] 400000000 of 400000000 100% in 13s 27.63 MB/s done Time to simulate a failure. Let's delete all the PGs used by .rgw.buckets on OSD.0. me@dev-ceph0:~$ ceph osd tree # id weight type name up/down reweight -1 0.09998 root default -2 0.04999 host dev-ceph0 0 0.04999 osd.0 up 1 -3 0.04999 host dev-ceph1 1 0.04999 osd.1 up 1 me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets pool 9 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 21 owner 18446744073709551615 me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du -sh 9.* 321M 9.0_head 289M 9.1_head 425M 9.2_head 357M 9.3_head 358M 9.4_head 309M 9.5_head 401M 9.6_head 397M 9.7_head me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ sudo rm -rf 9.* The cluster is still healthy me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_OK monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e44: 2 osds: 2 up, 2 in pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used, 94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s mdsmap e1: 0/0/1 up It probably hasn't noticed the damage yet, there's no I/O on this test cluster unless I generate it. Lets retrieve some data, that'll make the cluster notice. me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2 s3://bucket1/Object1 -> ./Object.Download2 [1 of 1] ERROR: S3 error: 404 (Not Found): me@client ~ $ s3cmd ls s3://bucket1 ERROR: S3 error: 404 (NoSuchKey): I wasn't expecting that. I expected my object to still be accessible. Worst case, it should be accessible 50% of the time. Instead, it's 0% accessible. And the cluster thinks it's still healhty: me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_OK monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e44: 2 osds: 2 up, 2 in pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used, 94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s mdsmap e1: 0/0/1 up Scrubbing the PGs corrects the cluster's status, but still doesn't let me download me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ for i in `seq 0 7` > do > ceph pg scrub 9.$i > done instructing pg 9.0 on osd.0 to scrub instructing pg 9.1 on osd.0 to scrub instructing pg 9.2 on osd.1 to scrub instructing pg 9.3 on osd.0 to scrub instructing pg 9.4 on osd.0 to scrub instructing pg 9.5 on osd.1 to scrub instructing pg 9.6 on osd.1 to scrub instructing pg 9.7 on osd.0 to scrub me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e44: 2 osds: 2 up, 2 in pgmap v4105: 248 pgs: 245 active+clean, 3 active+clean+inconsistent; 2852 MB data, 5088 MB used, 97258 MB / 102347 MB avail mdsmap e1: 0/0/1 up And I still can't download my data me@client ~ $ s3cmd ls s3://bucket1 ERROR: S3 error: 404 (NoSuchKey): To fix this, I have to scrub the OSD me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph osd scrub 0 osd.0 instructed to scrub This runs for a while, until it reaches the affected PGs. Then the PGs are recovering: me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_ERR 3 pgs inconsistent; 2 pgs recovering; 6 pgs recovery_wait; 8 pgs stuck unclean; recovery 988/1534 degraded (64.407%); recovering 2 o/s, 10647KB/s; 284 scrub errors monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e47: 2 osds: 2 up, 2 in pgmap v4151: 248 pgs: 240 active+clean, 4 active+recovery_wait, 1 active+recovering+inconsistent, 2 active+recovery_wait+inconsistent, 1 active+recovering; 2852 MB data, 5125 MB used 7KB/s mdsmap e1: 0/0/1 up As soon as the cluster starts recovering, I can access my object again: me@client ~ $ s3cmd ls s3://bucket1 2013-06-13 01:10 381M 15bdad3e014ca5f5c9e5c706e17d65f3 s3://bucket1/Object1 2013-06-12 00:02 13 8ddd8be4b179a529afa5f2ffae4b9858 s3://bucket1/hello.txt me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2 s3://bucket1/Object1 -> ./Object.Download2 [1 of 1] 400000000 of 400000000 100% in 92s 4.13 MB/s done me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_ERR 3 pgs inconsistent; 5 pgs recovering; 5 pgs stuck unclean; recovery 228/1534 degraded (14.863%); recovering 2 o/s, 11025KB/s; 284 scrub errors monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e47: 2 osds: 2 up, 2 in pgmap v4259: 248 pgs: 241 active+clean, 1 active+recovering+inconsistent, 2 active+clean+inconsistent, 4 active+recovering; 2852 MB data, 7428 MB used, 94919 MB / 102347 MB avail; 22 mdsmap e1: 0/0/1 up Everything continues to work, but the cluster doesn't completely heal: me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e47: 2 osds: 2 up, 2 in pgmap v4280: 248 pgs: 245 active+clean, 3 active+clean+inconsistent; 2852 MB data, 7934 MB used, 94413 MB / 102347 MB avail mdsmap e1: 0/0/1 up At this point, I have to scrub the inconsistent PGs me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph pg dump | grep inconsistent | cut -f1 | while read pg > do > ceph pg scrub $pg > done instructing pg 9.5 on osd.1 to scrub instructing pg 9.2 on osd.1 to scrub instructing pg 9.6 on osd.1 to scrub Everything continues to work, until cluster has fully recovered. me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status health HEALTH_OK monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1 osdmap e47: 2 osds: 2 up, 2 in pgmap v4283: 248 pgs: 248 active+clean; 2852 MB data, 7934 MB used, 94413 MB / 102347 MB avail mdsmap e1: 0/0/1 up So I'm a bit confused. Why was the data not accessible between the data loss and the manual OSD scrub? What the effective difference between the PG scrub and the OSD scrub? Thanks for the help. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com