Thanks. I'll have to get more
creative. :-)
On 6/14/13 18:19 , Gregory Farnum wrote:
Yeah. You've picked up on some warty bits of Ceph's
error handling here for sure, but it's exacerbated by the fact
that you're not simulating what you think. In a real disk error
situation the filesystem would be returning EIO or something, but
here it's returning ENOENT. Since the OSD is authoritative for
that key space and the filesystem says there is no such object,
presto! It doesn't exist.
If you restart the OSD it does a scan of the PGs on-disk as well
as what it should have, and can pick up on the data not being
there and recover. But "correctly" handling data that has been
(from the local FS' perspective) properly deleted under a
running process would require huge and expensive contortions on
the part of the daemon (in any distributed system that I can
think of).
-Greg
On Friday, June 14, 2013, Craig Lewis wrote:
So I'm trying to break
my test cluster, and figure out how to put it back together
again. I'm able to fix this, but the behavior seems strange
to me, so I wanted to run it past more experienced people.
I'm doing these tests using RadosGW. I currently have 2
nodes, with replication=2. (I haven't gotten to the cluster
expansion testing yet).
I'm going to upload a file, then simulate a disk failure by
deleting some PGs on one of the OSDs. I have seen this
mentioned as the way to fix OSDs that filled up during
recovery/backfill. I expected the cluster to detect the
error, change the cluster health to warn, then return the
data from another copy. Instead, I got a 404 error.
me@client ~ $ s3cmd ls
2013-06-12 00:02 s3://bucket1
me@client ~ $ s3cmd ls s3://bucket1
2013-06-12 00:02 13
8ddd8be4b179a529afa5f2ffae4b9858 s3://bucket1/hello.txt
me@client ~ $ s3cmd put Object1 s3://bucket1
Object1 -> s3://bucket1/Object1 [1 of 1]
400000000 of 400000000 100% in 62s 6.13
MB/s done
me@client ~ $ s3cmd ls s3://bucket1
2013-06-13 01:10 381M
15bdad3e014ca5f5c9e5c706e17d65f3 s3://bucket1/Object1
2013-06-12 00:02 13
8ddd8be4b179a529afa5f2ffae4b9858 s3://bucket1/hello.txt
So at this point, the cluster is healthy, and we can
download objects from RGW.
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
ceph status
health HEALTH_OK
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e44: 2 osds: 2 up, 2 in
pgmap v4055: 248 pgs: 248
active+clean; 2852 MB data, 7941 MB used, 94406 MB /
102347 MB avail; 17B/s rd, 0op/s
mdsmap e1: 0/0/1 up
me@client ~ $ s3cmd get
s3://bucket1/Object1 ./Object.Download1
s3://bucket1/Object1 ->
./Object.Download1 [1 of 1]
400000000 of 400000000 100% in
13s 27.63 MB/s done
Time to simulate a failure. Let's delete all the PGs
used by .rgw.buckets on OSD.0.
me@dev-ceph0:~$ ceph osd tree
# id weight type name up/down reweight
-1 0.09998 root default
-2 0.04999 host dev-ceph0
0 0.04999 osd.0 up 1
-3 0.04999 host dev-ceph1
1 0.04999 osd.1 up 1
me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets
pool 9 '.rgw.buckets' rep size 2
min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8
pgp_num 8 last_change 21 owner 18446744073709551615
me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du
-sh 9.*
321M 9.0_head
289M 9.1_head
425M 9.2_head
357M 9.3_head
358M 9.4_head
309M 9.5_head
401M 9.6_head
397M 9.7_head
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
sudo rm -rf 9.*
The cluster is still healthy
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
ceph status
health HEALTH_OK
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e44: 2 osds: 2 up, 2 in
pgmap v4059: 248 pgs: 248
active+clean; 2852 MB data, 7941 MB used, 94406 MB /
102347 MB avail; 16071KB/s rd, 3op/s
mdsmap e1: 0/0/1 up
It probably hasn't noticed the damage yet, there's no
I/O on this test cluster unless I generate it. Lets
retrieve some data, that'll make the cluster notice.
me@client ~ $ s3cmd get
s3://bucket1/Object1 ./Object.Download2
s3://bucket1/Object1 ->
./Object.Download2 [1 of 1]
ERROR: S3 error: 404 (Not Found):
me@client ~ $ s3cmd ls s3://bucket1
ERROR: S3 error: 404 (NoSuchKey):
I wasn't expecting that. I expected my object to still
be accessible. Worst case, it should be accessible 50% of
the time. Instead, it's 0% accessible. And the cluster
thinks it's still healhty:
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
ceph status
health HEALTH_OK
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e44: 2 osds: 2 up, 2 in
pgmap v4059: 248 pgs: 248
active+clean; 2852 MB data, 7941 MB used, 94406 MB /
102347 MB avail; 16071KB/s rd, 3op/s
mdsmap e1: 0/0/1 up
Scrubbing the PGs corrects the cluster's status, but still
doesn't let me download
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ for i in
`seq 0 7`
> do
> ceph pg scrub 9.$i
> done
instructing pg 9.0 on osd.0 to scrub
instructing pg 9.1 on osd.0 to scrub
instructing pg 9.2 on osd.1 to scrub
instructing pg 9.3 on osd.0 to scrub
instructing pg 9.4 on osd.0 to scrub
instructing pg 9.5 on osd.1 to scrub
instructing pg 9.6 on osd.1 to scrub
instructing pg 9.7 on osd.0 to scrub
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph
status
health HEALTH_ERR 3 pgs inconsistent; 284 scrub
errors
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e44: 2 osds: 2 up, 2 in
pgmap v4105: 248 pgs: 245 active+clean, 3
active+clean+inconsistent; 2852 MB data, 5088 MB used,
97258 MB / 102347 MB avail
mdsmap e1: 0/0/1 up
And I still can't download my data
me@client ~ $ s3cmd ls s3://bucket1
ERROR: S3 error: 404 (NoSuchKey):
To fix this, I have to scrub the OSD
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph osd
scrub 0
osd.0 instructed to scrub
This runs for a while, until it reaches the affected PGs.
Then the PGs are recovering:
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph
status
health HEALTH_ERR 3 pgs inconsistent;
2 pgs recovering; 6 pgs recovery_wait; 8 pgs stuck
unclean; recovery 988/1534 degraded (64.407%); recovering
2 o/s, 10647KB/s; 284 scrub errors
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e47: 2 osds: 2 up, 2 in
pgmap v4151: 248 pgs: 240 active+clean, 4
active+recovery_wait, 1 active+recovering+inconsistent, 2
active+recovery_wait+inconsistent, 1 active+recovering;
2852 MB data, 5125 MB used
7KB/s
mdsmap e1: 0/0/1 up
As soon as the cluster starts recovering, I can access my
object again:
me@client ~ $ s3cmd ls s3://bucket1
2013-06-13 01:10 381M
15bdad3e014ca5f5c9e5c706e17d65f3 s3://bucket1/Object1
2013-06-12 00:02 13
8ddd8be4b179a529afa5f2ffae4b9858 s3://bucket1/hello.txt
me@client ~ $ s3cmd get s3://bucket1/Object1
./Object.Download2
s3://bucket1/Object1 -> ./Object.Download2 [1
of 1]
400000000 of 400000000 100% in 92s 4.13
MB/s done
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
ceph status
health HEALTH_ERR 3 pgs inconsistent; 5 pgs
recovering; 5 pgs stuck unclean; recovery 228/1534
degraded (14.863%); recovering 2 o/s, 11025KB/s; 284
scrub errors
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e47: 2 osds: 2 up, 2 in
pgmap v4259: 248 pgs: 241 active+clean, 1
active+recovering+inconsistent, 2
active+clean+inconsistent, 4 active+recovering; 2852 MB
data, 7428 MB used, 94919 MB / 102347 MB avail; 22
mdsmap e1: 0/0/1 up
Everything continues to work, but the cluster doesn't
completely heal:
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
ceph status
health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e47: 2 osds: 2 up, 2 in
pgmap v4280: 248 pgs: 245 active+clean, 3
active+clean+inconsistent; 2852 MB data, 7934 MB used,
94413 MB / 102347 MB avail
mdsmap e1: 0/0/1 up
At this point, I have to scrub the inconsistent PGs
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph pg
dump | grep inconsistent | cut -f1 | while read pg
> do
> ceph pg scrub $pg
> done
instructing pg 9.5 on osd.1 to scrub
instructing pg 9.2 on osd.1 to scrub
instructing pg 9.6 on osd.1 to scrub
Everything continues to work, until cluster has fully
recovered.
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
ceph status
health HEALTH_OK
monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
osdmap e47: 2 osds: 2 up, 2 in
pgmap v4283: 248 pgs: 248 active+clean; 2852 MB
data, 7934 MB used, 94413 MB / 102347 MB avail
mdsmap e1: 0/0/1 up
So I'm a bit confused.
Why was the data not accessible between the data loss and
the manual OSD scrub?
What the effective difference between the PG scrub and the
OSD scrub?
Thanks for the help.
--
Software Engineer #42 @ http://inktank.com
| http://ceph.com
Craig Lewis
Senior Systems Engineer
Office +1.714.602.1309
Email clewis@xxxxxxxxxxxxxxxxxx
Central Desktop. Work
together in ways you never thought possible.
Connect with us Website | Twitter | Facebook | LinkedIn | Blog
|