Re: Corruption by missing blocks

Sage Weil <sage@xxxxxxxxxxx> · Tue, 23 Apr 2013 16:24:01 -0700 (PDT)

On Tue, 23 Apr 2013, Bryan Stillwell wrote:
> I'm testing this now, but while going through the logs I saw something
> that might have something to do with this:
> 
> Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
> epoch 22146 off 102 (ffff88021e0dc802 of
> ffff88021e0dc79c-ffff88021e0dc802)

Oh, that's not right...  What kernel version is this?  Which ceph version?

Thanks-
sage

> Apr 23 16:35:28 a1 kernel: [692455.505154] osdmap: 00000000: 05 00 69
> 17 a0 33 34 39 4f d7 88 db 46 c9 e1 df  ..i..349O...F...
> Apr 23 16:35:28 a1 kernel: [692455.505158] osdmap: 00000010: 0d 6e 82
> 56 00 00 b0 0c 77 51 00 1a 00 22 ff ff  .n.V....wQ..."..
> Apr 23 16:35:28 a1 kernel: [692455.505161] osdmap: 00000020: ff ff ff
> ff ff ff 00 00 00 00 00 00 00 00 ff ff  ................
> Apr 23 16:35:28 a1 kernel: [692455.505163] osdmap: 00000030: ff ff 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> Apr 23 16:35:28 a1 kernel: [692455.505166] osdmap: 00000040: 00 00 00
> 00 00 00 00 00 00 00 01 00 00 00 ff ff  ................
> Apr 23 16:35:28 a1 kernel: [692455.505169] osdmap: 00000050: 5c 02 00
> 00 00 00 03 00 00 00 0c 00 00 00 00 00  \...............
> Apr 23 16:35:28 a1 kernel: [692455.505171] osdmap: 00000060: 00 00 02
> 00 00 00                                ......
> Apr 23 16:35:28 a1 kernel: [692455.505174] libceph: osdc handle_map corrupt msg
> Apr 23 16:35:28 a1 kernel: [692455.513590] header: 00000000: 90 03 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> Apr 23 16:35:28 a1 kernel: [692455.513593] header: 00000010: 29 00 c4
> 00 01 00 86 00 00 00 00 00 00 00 00 00  )...............
> Apr 23 16:35:28 a1 kernel: [692455.513596] header: 00000020: 00 00 00
> 00 01 00 00 00 00 00 00 00 00 01 00 00  ................
> Apr 23 16:35:28 a1 kernel: [692455.513599] header: 00000030: 00 5d 68
> c5 e8                                   .]h..
> Apr 23 16:35:28 a1 kernel: [692455.513602]  front: 00000000: 69 17 a0
> 33 34 39 4f d7 88 db 46 c9 e1 df 0d 6e  i..349O...F....n
> Apr 23 16:35:28 a1 kernel: [692455.513605]  front: 00000010: 01 00 00
> 00 82 56 00 00 66 00 00 00 05 00 69 17  .....V..f.....i.
> Apr 23 16:35:28 a1 kernel: [692455.513607]  front: 00000020: a0 33 34
> 39 4f d7 88 db 46 c9 e1 df 0d 6e 82 56  .349O...F....n.V
> Apr 23 16:35:28 a1 kernel: [692455.513610]  front: 00000030: 00 00 b0
> 0c 77 51 00 1a 00 22 ff ff ff ff ff ff  ....wQ..."......
> Apr 23 16:35:28 a1 kernel: [692455.513613]  front: 00000040: ff ff 00
> 00 00 00 00 00 00 00 ff ff ff ff 00 00  ................
> Apr 23 16:35:28 a1 kernel: [692455.513616]  front: 00000050: 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> Apr 23 16:35:28 a1 kernel: [692455.513618]  front: 00000060: 00 00 00
> 00 00 00 01 00 00 00 ff ff 5c 02 00 00  ............\...
> Apr 23 16:35:28 a1 kernel: [692455.513621]  front: 00000070: 00 00 03
> 00 00 00 0c 00 00 00 00 00 00 00 02 00  ................
> Apr 23 16:35:28 a1 kernel: [692455.513624]  front: 00000080: 00 00 00
> 00 00 00                                ......
> Apr 23 16:35:28 a1 kernel: [692455.513627] footer: 00000000: ae ee 1e
> d8 00 00 00 00 00 00 00 00 01           .............
> 
> On Tue, Apr 23, 2013 at 4:41 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> > On Tue, Apr 23, 2013 at 3:37 PM, Bryan Stillwell
> > <bstillwell@xxxxxxxxxxxxxxx> wrote:
> >> I'm using the kernel client that's built into precise & quantal.
> >>
> >> I could give the ceph-fuse client a try and see if it has the same
> >> issue.  I haven't used it before, so I'll have to do some reading
> >> first.
> >
> > If you've got the time that would be a good data point, and make
> > debugging easier if it reproduces. There's not a ton to learn ? you
> > install the ceph-fuse package (I think it's packaged separately,
> > anyway) and then instead of "mount" you run "ceph-fuse -c <ceph.conf
> > file> --name client.<name> --keyring <keyring_file>" or similar. :)
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >
> >>
> >> Bryan
> >>
> >> On Tue, Apr 23, 2013 at 4:04 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> >>> Sorry, I meant kernel client or ceph-fuse? Client logs would be enough
> >>> to start with, I suppose ? "debug client = 20" and "debug ms = 1" if
> >>> using ceph-fuse; if using the kernel client things get tricker; I'd
> >>> have to look at what logging is available without the debugfs stuff
> >>> being enabled. :/
> >>> -Greg
> >>> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>>
> >>>
> >>> On Tue, Apr 23, 2013 at 3:00 PM, Bryan Stillwell
> >>> <bstillwell@xxxxxxxxxxxxxxx> wrote:
> >>>> I've tried a few different ones:
> >>>>
> >>>> 1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
> >>>> 2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
> >>>> 3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
> >>>>
> >>>> It's fairly reproducible, so I can collect logs for you.  Which ones
> >>>> would you be interested in?
> >>>>
> >>>> The cluster has been in a couple states during testing (during
> >>>> expansion/rebalancing and during an all active+clean state).
> >>>>
> >>>> BTW, all the nodes are running with the 0.56.4-1precise packages.
> >>>>
> >>>> Bryan
> >>>>
> >>>> On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> >>>>> On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
> >>>>> <bstillwell@xxxxxxxxxxxxxxx> wrote:
> >>>>>> I've run into an issue where after copying a file to my cephfs cluster
> >>>>>> the md5sums no longer match.  I believe I've tracked it down to some
> >>>>>> parts of the file which are missing:
> >>>>>>
> >>>>>> $ obj_name=$(cephfs "title1.mkv" show_location -l 0 | grep object_name
> >>>>>> | sed -e "s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/")
> >>>>>> $ echo "Object name: $obj_name"
> >>>>>> Object name: 10000001120
> >>>>>>
> >>>>>> $ file_size=$(stat "title1.mkv" | grep Size | awk '{ print $2 }')
> >>>>>> $ printf "File size: %d MiB (%d Bytes)\n" $(($file_size/1048576)) $file_size
> >>>>>> File size: 20074 MiB (21049178117 Bytes)
> >>>>>>
> >>>>>> $ blocks=$((file_size/4194304+1))
> >>>>>> $ printf "Blocks: %d\n" $blocks
> >>>>>> Blocks: 5019
> >>>>>>
> >>>>>> $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
> >>>>>> ${obj_name}.`printf '%8.8x\n' $b` | grep "error"; done
> >>>>>>  error stat-ing data/10000001120.00001076: No such file or directory
> >>>>>>  error stat-ing data/10000001120.000011c7: No such file or directory
> >>>>>>  error stat-ing data/10000001120.0000129c: No such file or directory
> >>>>>>  error stat-ing data/10000001120.000012f4: No such file or directory
> >>>>>>  error stat-ing data/10000001120.00001307: No such file or directory
> >>>>>>
> >>>>>>
> >>>>>> Any ideas where to look to investigate what caused these blocks to not
> >>>>>> be written?
> >>>>>
> >>>>> What client are you using to write this? Is it fairly reproducible (so
> >>>>> you could collect logs of it happening)?
> >>>>>
> >>>>> Usually the only times I've seen anything like this were when either
> >>>>> the file data was supposed to go into a pool which the client didn't
> >>>>> have write permissions on, or when the RADOS cluster was in bad shape
> >>>>> and so the data never got flushed to disk. Has your cluster been
> >>>>> healthy since you started writing the file out?
> >>>>> -Greg
> >>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Here's the current state of the cluster:
> >>>>>>
> >>>>>> ceph -s
> >>>>>>    health HEALTH_OK
> >>>>>>    monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
> >>>>>>    osdmap e22059: 24 osds: 24 up, 24 in
> >>>>>>     pgmap v1783615: 1920 pgs: 1917 active+clean, 3
> >>>>>> active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
> >>>>>> 13592 GB avail
> >>>>>>    mdsmap e437: 1/1/1 up {0=a=up:active}
> >>>>>>
> >>>>>> Here's my current crushmap:
> >>>>>>
> >>>>>> # begin crush map
> >>>>>>
> >>>>>> # devices
> >>>>>> device 0 osd.0
> >>>>>> device 1 osd.1
> >>>>>> device 2 osd.2
> >>>>>> device 3 osd.3
> >>>>>> device 4 osd.4
> >>>>>> device 5 osd.5
> >>>>>> device 6 osd.6
> >>>>>> device 7 osd.7
> >>>>>> device 8 osd.8
> >>>>>> device 9 osd.9
> >>>>>> device 10 osd.10
> >>>>>> device 11 osd.11
> >>>>>> device 12 osd.12
> >>>>>> device 13 osd.13
> >>>>>> device 14 osd.14
> >>>>>> device 15 osd.15
> >>>>>> device 16 osd.16
> >>>>>> device 17 osd.17
> >>>>>> device 18 osd.18
> >>>>>> device 19 osd.19
> >>>>>> device 20 osd.20
> >>>>>> device 21 osd.21
> >>>>>> device 22 osd.22
> >>>>>> device 23 osd.23
> >>>>>>
> >>>>>> # types
> >>>>>> type 0 osd
> >>>>>> type 1 host
> >>>>>> type 2 rack
> >>>>>> type 3 row
> >>>>>> type 4 room
> >>>>>> type 5 datacenter
> >>>>>> type 6 pool
> >>>>>>
> >>>>>> # buckets
> >>>>>> host b1 {
> >>>>>>         id -2           # do not change unnecessarily
> >>>>>>         # weight 2.980
> >>>>>>         alg straw
> >>>>>>         hash 0  # rjenkins1
> >>>>>>         item osd.0 weight 0.500
> >>>>>>         item osd.1 weight 0.500
> >>>>>>         item osd.2 weight 0.500
> >>>>>>         item osd.3 weight 0.500
> >>>>>>         item osd.4 weight 0.500
> >>>>>>         item osd.20 weight 0.480
> >>>>>> }
> >>>>>> host b2 {
> >>>>>>         id -4           # do not change unnecessarily
> >>>>>>         # weight 4.680
> >>>>>>         alg straw
> >>>>>>         hash 0  # rjenkins1
> >>>>>>         item osd.5 weight 0.500
> >>>>>>         item osd.6 weight 0.500
> >>>>>>         item osd.7 weight 2.200
> >>>>>>         item osd.8 weight 0.500
> >>>>>>         item osd.9 weight 0.500
> >>>>>>         item osd.21 weight 0.480
> >>>>>> }
> >>>>>> host b3 {
> >>>>>>         id -5           # do not change unnecessarily
> >>>>>>         # weight 3.480
> >>>>>>         alg straw
> >>>>>>         hash 0  # rjenkins1
> >>>>>>         item osd.10 weight 0.500
> >>>>>>         item osd.11 weight 0.500
> >>>>>>         item osd.12 weight 1.000
> >>>>>>         item osd.13 weight 0.500
> >>>>>>         item osd.14 weight 0.500
> >>>>>>         item osd.22 weight 0.480
> >>>>>> }
> >>>>>> host b4 {
> >>>>>>         id -6           # do not change unnecessarily
> >>>>>>         # weight 3.480
> >>>>>>         alg straw
> >>>>>>         hash 0  # rjenkins1
> >>>>>>         item osd.15 weight 0.500
> >>>>>>         item osd.16 weight 1.000
> >>>>>>         item osd.17 weight 0.500
> >>>>>>         item osd.18 weight 0.500
> >>>>>>         item osd.19 weight 0.500
> >>>>>>         item osd.23 weight 0.480
> >>>>>> }
> >>>>>> pool default {
> >>>>>>         id -1           # do not change unnecessarily
> >>>>>>         # weight 14.620
> >>>>>>         alg straw
> >>>>>>         hash 0  # rjenkins1
> >>>>>>         item b1 weight 2.980
> >>>>>>         item b2 weight 4.680
> >>>>>>         item b3 weight 3.480
> >>>>>>         item b4 weight 3.480
> >>>>>> }
> >>>>>>
> >>>>>> # rules
> >>>>>> rule data {
> >>>>>>         ruleset 0
> >>>>>>         type replicated
> >>>>>>         min_size 2
> >>>>>>         max_size 10
> >>>>>>         step take default
> >>>>>>         step chooseleaf firstn 0 type host
> >>>>>>         step emit
> >>>>>> }
> >>>>>> rule metadata {
> >>>>>>         ruleset 1
> >>>>>>         type replicated
> >>>>>>         min_size 2
> >>>>>>         max_size 10
> >>>>>>         step take default
> >>>>>>         step chooseleaf firstn 0 type host
> >>>>>>         step emit
> >>>>>> }
> >>>>>> rule rbd {
> >>>>>>         ruleset 2
> >>>>>>         type replicated
> >>>>>>         min_size 1
> >>>>>>         max_size 10
> >>>>>>         step take default
> >>>>>>         step chooseleaf firstn 0 type host
> >>>>>>         step emit
> >>>>>> }
> >>>>>>
> >>>>>> # end crush map
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Bryan
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >>
> >>
> >> Bryan Stillwell
> >> SENIOR SYSTEM ADMINISTRATOR
> >>
> >> E: bstillwell@xxxxxxxxxxxxxxx
> >> O: 303.228.5109
> >> M: 970.310.6085
> 
> 
> 
> --
> 
> 
> Bryan Stillwell
> SENIOR SYSTEM ADMINISTRATOR
> 
> E: bstillwell@xxxxxxxxxxxxxxx
> O: 303.228.5109
> M: 970.310.6085
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com