Re: Ceph and its failures

Christian Balzer <chibi@xxxxxxx> · Sun, 21 Feb 2016 15:08:03 +0900

Hello,

On Fri, 19 Feb 2016 21:05:50 +0200 Nmz wrote:

> 
> Hello Ceph happy users. Starting this test I want to understand how Ceph
> can protect my data and what I have to do in some situations. So let's
> begin
> 
> == Preparation
> 
> ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
> 
> Ceph contains
>  MON: 3
>  OSD: 3
>
For completeness sake, the OSDs are on 3 different hosts, right?

> File system: ZFS
That is the odd one out, very few people I'm aware of use it, support for
it is marginal at best.
And some of its features may of course obscure things.

Exact specification please, as in how is ZFS configured (single disk,
raid-z, etc)?

> Kernel: 4.2.6
> 
While probably not related, I vaguely remember 4.3 being recommended for
use with Ceph.

> Preparing pool
> 
> # ceph osd pool create rbd 100
> pool 'rbd' created
> 
> # ceph osd pool set rbd size 3
> set pool 16 size to 3
> 
> RBD client
> 
> # rbd create test --size 4G
> # rbd map test
> /dev/rbd0
> # mkfs.ext2 /dev/rbd0
> # mount /dev/rbd0 /mnt
> # printf "aaaaaaaaaa\nbbbbbbbbbb" > /mnt/file
> 
> 
> Searching PG for our file
> 
> # grep "aaaaaaaaa" * -R
> Binary file osd/nmz-0-journal/journal matches
> Binary file
> osd/nmz-1/current/16.22_head/rbd\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> matches Binary file
> osd/nmz-2/current/16.22_head/rbd\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> matches Binary file osd/nmz-1-journal/journal matches Binary file
> osd/nmz-0/current/16.22_head/rbd\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> matches Binary file osd/nmz-2-journal/journal matches
> 
> PG info
> 
> # ceph pg ls      
> pg_stat	objects	mip	degr	misp
> unf	bytes	log	disklog	state
> state_stamp			v	reported	up
> up_primary	acting	acting_primary last_scrub
> scrub_stamp			last_deep_scrub
> deep_scrub_stamp 16.22	1	0	0	0
> 0	8192	2	2	active+clean	2016-02-19
> 08:46:11.157938	242'2	242:14		[2,1,0]
> 2		[2,1,0]	2		0'0
> 2016-02-19 08:45:38.006134	0'0		2016-02-19
> 08:45:38.006134
> 
> Primary PG is in osd.2. Lets do file checksum
> 
> # md5sum
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \95818f285434d626ab26255410f9a447
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> 
> == Fail imitation #1
> 
> Lets corrupt backup PG
> 
> # sed -i -r 's/aaaaaaaaaa/abaaaaaaaa/g'
> osd/nmz-0/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> # sed -i -r 's/aaaaaaaaaa/acaaaaaaaa/g'
> osd/nmz-1/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> # md5sum
> osd/nmz-*/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \99555c6c3ed07550b5fdfd2411b94fdd
> osd/nmz-0/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \8cf7cc66d7f0dc7804fbfef492bcacfd
> osd/nmz-1/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \95818f285434d626ab26255410f9a447
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> lets do scrub to find the corruption
>

Scrub only checks that all objects and meta-data is there, it never looks
at that actual objects or compares them.

Only deep-scrub does that.

> # ceph osd scrub 0
> 
> 7f8732f33700  0 log_channel(cluster) log [INF] : 16.63 scrub starts
> 7f873072e700  0 log_channel(cluster) log [INF] : 16.63 scrub ok
> ....
> 7f8732732700  0 log_channel(cluster) log [INF] : 16.2d scrub starts
> 7f8734f37700  0 log_channel(cluster) log [INF] : 16.2d scrub ok
> 7f8730f2f700  0 log_channel(cluster) log [INF] : 16.2b scrub starts
> 7f8733734700  0 log_channel(cluster) log [INF] : 16.2b scrub ok
> 7f8731730700  0 log_channel(cluster) log [INF] : 16.2a scrub starts
> 7f8733f35700  0 log_channel(cluster) log [INF] : 16.2a scrub ok
> 7f8733f35700  0 log_channel(cluster) log [INF] : 16.25 scrub starts
> 7f8731730700  0 log_channel(cluster) log [INF] : 16.25 scrub ok
> 7f8733f35700  0 log_channel(cluster) log [INF] : 16.20 scrub starts
> 7f8731730700  0 log_channel(cluster) log [INF] : 16.20 scrub ok
> ....
> 7f8734f37700  0 log_channel(cluster) log [INF] : 16.0 scrub ok
> 
> scrub did not touch 16.22 PG. Same with osd.1
> 
That is odd, but note that scrub doesn't necessarily do things
sequentially.

On my clusters the automatic, periodic scrub and deep-scrub certainly go
through all PGs and I kicked them off originally via "ceph osd scrub \*".

> # ceph osd deep-scrub 0
> 
> same results. scrub vs deep-scrub google?
> 
> # ceph pg scrub 16.22
> instructing pg 16.22 on osd.2 to scrub
> 
> Only primary PG is checking.
> 
Yes, scrub versus deep-scrub.

> So I dont know how to make ceph to check all PG in OSD
> 
> 
> == Fail imitation #2
> 
> Lets change others PG files. Lets make osd.0 to be fine and other
> corrupted
> 
> # sed -i -r 's/aaaaaaaaaa/adaaaaaaaa/g'
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10 
> 
> # md5sum
> osd/nmz-*/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \95818f285434d626ab26255410f9a447
> osd/nmz-0/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \8cf7cc66d7f0dc7804fbfef492bcacfd
> osd/nmz-1/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \852a51b44552ffbb2b0350966c9aa3b2
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> 
> # ceph osd scrub 2
> osd.2 instructed to scrub
> 
> 7f5e8b686700  0 log_channel(cluster) log [INF] : 16.22 scrub starts
> 7f5e88e81700  0 log_channel(cluster) log [INF] : 16.22 scrub ok
> 
> No error detection?
> 
Never looked at it.

> # ceph osd deep-scrub 2
> osd.2 instructed to deep-scrub
> 
> 7f5e88e81700  0 log_channel(cluster) log [INF] : 16.22 deep-scrub starts
> 7f5e8b686700  0 log_channel(cluster) log [INF] : 16.22 deep-scrub ok
> 
This should have found it.

> Still no error detection? Lets check file with md5
> 
> # md5sum
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \852a51b44552ffbb2b0350966c9aa3b2
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> OSD use cache? Lets restart osd.2
> 
Don't think so. But ZFS may have done something to confuse it or an
untested code path in Ceph in conjunction with ZFS.

> -- After success restart
> 
> # ceph pg scrub 16.22
> instructing pg 16.22 on osd.2 to scrub
> 
> 7fc475e31700  0 log_channel(cluster) log [INF] : 16.22 scrub starts
> 7fc478636700 -1 log_channel(cluster) log [ERR] : 16.22 shard 2: soid
> 16/a7e34aa2/rbd_data.1a72a39011461.0000000000000001/head missing attr _,
> missing attr snapset 7fc478636700 -1 log_channel(cluster) log [ERR] :
> 16.22 scrub 0 missing, 1 inconsistent objects 7fc478636700 -1
> log_channel(cluster) log [ERR] : 16.22 scrub 1 errors
> 
> # ceph -s
>     cluster 26fdb24b-9004-4e2b-a8d7-c28f45464084
>      health HEALTH_ERR
>             1 pgs inconsistent
>             1 scrub errors
>      monmap e7: 3 mons at
> {a=10.10.8.1:6789/0,b=10.10.8.1:6790/0,c=10.10.8.1:6791/0} election
> epoch 60, quorum 0,1,2 a,b,c osdmap e250: 3 osds: 3 up, 3 in
>             flags sortbitwise
>       pgmap v3172: 100 pgs, 1 pools, 143 MB data, 67 objects
>             101 MB used, 81818 MB / 81920 MB avail
>                   99 active+clean
>                    1 active+clean+inconsistent
> 
> No auto health ?
>
Nope.

> # ceph pg repair 16.22
> instructing pg 16.22 on osd.2 to repair
> 
> 7fc475e31700  0 log_channel(cluster) log [INF] : 16.22 repair starts
> 7fc478636700 -1 log_channel(cluster) log [ERR] : 16.22 shard 2: soid
> 16/a7e34aa2/rbd_data.1a72a39011461.0000000000000001/head data_digest
> 0xd444e973 != known data_digest 0xb9b5bcf4 from auth shard 0, missing
> attr _, missing attr snapset 7fc478636700 -1 log_channel(cluster) log
> [ERR] : 16.22 repair 0 missing, 1 inconsistent objects 7fc478636700 -1
> log_channel(cluster) log [ERR] : 16.22 repair 1 errors, 1 fixed
> 
> Lets do checksum
> 
> # md5sum
> osd/nmz-*/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \95818f285434d626ab26255410f9a447
> osd/nmz-0/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \8cf7cc66d7f0dc7804fbfef492bcacfd
> osd/nmz-1/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \95818f285434d626ab26255410f9a447
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> Primary PG is fixed but PG in osd.1 is left unchanged.
>
It (osd.1) wasn't deep-scrubbed, but in the repair it probably should have
been involved. Alas that isn't the case, obviously.

> -- tunning
> 
> Lets change PG primary osd
> 
> # ceph tell mon.*  injectargs -- --mon_osd_allow_primary_temp=true 
> mon.a: injectargs:mon_osd_allow_primary_temp = 'true' 
> mon.b: injectargs:mon_osd_allow_primary_temp = 'true' 
> mon.c: injectargs:mon_osd_allow_primary_temp = 'true'
> 
> # ceph osd primary-temp 16.22 1
> set 16.22 primary_temp mapping to 1
> 
> # ceph osd scrub 1
> osd.1 instructed to scrub
> 
> 7f8a909a2700  0 log_channel(cluster) log [INF] : 16.22 scrub starts
> 7f8a931a7700  0 log_channel(cluster) log [INF] : 16.22 scrub ok
> 
> No detection
> 
> # ceph pg scrub 16.22
> instructing pg 16.22 on osd.1 to scrub
> 
> 7f8a931a7700  0 log_channel(cluster) log [INF] : 16.22 scrub starts
> 7f8a909a2700  0 log_channel(cluster) log [INF] : 16.22 scrub ok
> 
> Still nothing. Lets check md5
> 
> # md5sum
> osd/nmz-*/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \95818f285434d626ab26255410f9a447
> osd/nmz-0/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \8cf7cc66d7f0dc7804fbfef492bcacfd
> osd/nmz-1/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> \852a51b44552ffbb2b0350966c9aa3b2
> osd/nmz-2/current/16.22_head/rbd\\udata.1a72a39011461.0000000000000001__head_A7E34AA2__10
> 
> 
> File is still corrupted. 
> 
> 
> So my questions are:
> 
> 1. How to make full OSD scrub not part of it.
Only primary PGs are compared to their secondaries. 

> 2. Why scrub do not detect corrupted files?
It normally does (deep-scrub, that is).

> 3. Does Ceph have auto heal option?
No. 
And neither is the repair function a good idea w/o checking the data on
disk first.
This is my biggest pet peeve with Ceph and you will find it mentioned
frequently in this ML, just a few days ago this thread for example:
"pg repair behavior? (Was: Re: getting rid of misplaced objects)"

> 4. Does Ceph use some CRC mechanism to detect corrupted bit before
> return data?
> 
No, another thing that really should be addressed within Ceph instead of
hoping that the underlying FS deals with this (as ZFS and BTRFS should).

Regards,

Christian

> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com