Re: Random data corruption in VM, possibly caused by rbd

Guido Winkelmann <guido-ceph@xxxxxxxxxxxxxxxxx> · Tue, 12 Jun 2012 14:31:47 +0200

Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
> If you can reproduce it with 'debug filestore = 20' too, that will be
> better, as it will tell us what the FIEMAP ioctl is returning.

I ran another testrun with 'debug filestore = 20'.

> Also, if
> you can attach/post the contents of the object itself (rados -p rbd get
> rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
> data (and the sparse-read operation that librbd is doing is the culprit).

I tried that, with the block name that the steps further below gave me:

rados -p rbd get rb.0.13.00000000045a block

When I looked into the block, it looked like a bunch of temp files from the 
portage system with padding in between, although it should be random data... I 
think I got the wrong block after all...

Here's what I did:
Run the iotester again:
testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date                                                                                                                                                                 
Tue Jun 12 13:51:58 CEST 2012                                                                                                               
Wrote 100 MiB of data in 2004 milliseconds                                                                                                  
[snip lots of irrelevant lines]
Wrote 100 MiB of data in 2537 milliseconds
Read 100 MiB of data in 3794 milliseconds
Read 100 MiB of data in 10150 milliseconds
Digest wrong for file "/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e"
Tue Jun 12 13:55:00 CEST 2012

Run the fiemap tool on that file:

testserver-rbd11 ~ # ./fiemap 
/var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e
File /var/iotest/4299a48eca63c75d6773bec3565190aa3b33c46e has 1 extents:
#       Logical          Physical         Length           Flags
0:      0000000000000000 0000000116900000 0000000000100000 0001

> As for the log:
> 
> First, map the offset to an rbd block.  For example, taking the 'Physical'
> value of 00000000a8200000 from above:
> 
> $ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
> 0000000002a0

That gave me

 $ printf "%012x\n" $((0x0000000116900000 / (4096*1024) ))
00000000045a

> Then figure out what the object name prefix is:
> 
> $ rbd info <imagename> | grep prefix
>         block_name_prefix: rb.0.1

Result: block_name_prefix: rb.0.13

> Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.

Result: rb.0.13.00000000045a

> Then map that back to an osd with
> 
> $ ceph osd map rbd rb.0.1.0000000002a0
> osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
> (2.5) -> up [0,2] acting [0,2]

That gives me
[root@storage1 ~]# ceph osd map rbd rb.0.13.00000000045a 2> /dev/null 
osdmap e101 pool 'rbd' (2) object 'rb.0.13.00000000045a' -> pg 2.80b039fb 
(2.7b) -> up [2,1] acting [2,1]

> You'll see the osd ids listed in brackets after 'active'.  We want the
> first one, 0 in my example.  The log from that OSD is what we need.

Okay, i'm attaching the compressed log for osd.2 and the compressed block to 
the issue report in the redmine.

	Guido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html