Re: clone_range in BlueStore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 30 Jan 2017, Igor Fedotov wrote:
> 
> On 30.01.2017 17:18, Sage Weil wrote:
> > On Mon, 30 Jan 2017, Igor Fedotov wrote:
> > > Hi Sage,
> > > 
> > > It looks like there is some bug somewhere in
> > > BlueStore/store_test/clone_range.
> > > 
> > > I'm occasionally hitting an assert on mismatched data in read result while
> > > performing SyntheticMatrixCsumVsCompression/2 test case.
> > > 
> > > --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
> > > --- expected:
> > > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 
> > > 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39
> > > |9517271441891379|
> > > 
> > > <skipped>
> > > 
> > > 00007400  30 31 33 32 34 39 35 35  30 38 32 37 39 32 37 31
> > > |0132495508279271|
> > > 
> > > 00007410  37 37 37 31 31 38 31 37  33 36 32 36 33 33 31 34
> > > |7771181736263314|
> > > 
> > > --- actual:
> > > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39
> > > |9517271441891379|
> > > 
> > > <skipped>
> > > 
> > > 00007400  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 0000a200  32 35 32 32 38 33 31 34  35 38 37 36 34 35 36 33
> > > |2522831458764563|
> > > 
> > > Multiple runs are required to hit that though...
> > > 
> > > I did some analysis and it seems that there are some issues with
> > > clone_range2
> > > stuff.
> > > 
> > > First of all - do we have any limits prerequisites on src/dst offsets in
> > > this
> > > request? E.g. should they be aligned similarly within alloc unit
> > > boundaries? I
> > > recall some discussions on that a while ago.
> > > 
> > > store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
> > > 0x10000)
> > > 
> > >   "ops": [
> > >          {
> > >              "op_num": 0,
> > >              "op_name": "clonerange2",
> > >              "collection": "555.0_head",
> > >              "src_oid":
> > > "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> > > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> > > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
> > >              "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
> > >              "src_offset": 107520,
> > > `            "len": 78336,
> > >              "dst_offset": 27648
> > >          }
> > >      ]
> > > 
> > > This results in potentially invalid blobs for the destination objects, see
> > > extent starting at 0x7400 below - it has blob offset = 0 and hence blob
> > > isn't
> > > aligned with min_alloc_size:
> > Oh, right.
> > 
> > Well, the good news is the OSD no longer has any callers for which the src
> > and dst clone_range offsets are different, so we could simply assert that
> > they match.  That's the simplest fix.  It's party a question of whether
> > we expect future cases where we will need to clone between offsets.
> > Perhaps we assert for now but don't clean up the interface in case
> > we need to backtrack later?
> > 
> > Or we could do something more limited.  The problem below is less about
> > min_alloc_size and more that it's not block aligned, I think, right?  We
> Perhaps you're right and that's rather about block aligned blobs/extents. But
> I'm a bit worried about having AU-unaligned blobs. IMO we don't test such
> cases much. I think it's better to produce extents/blobs similarly to the
> mainstream write  path, i.e. AU-alignment. Other approaches are more
> error-prone and much harder to catch due to their rarity. On the other hand we
> have a capability to modify AU-size on the flight and hence we violate
> AU-alignment requirement this way too....

I think we should either adjust the store_test synthetic thing to adjust 
min_alloc_size randomly, or drop the ability to change it at all.  Leaning 
toward the latter.. let's discuss during standup.

> >   could make clone_range fall back to the read/write path if the alignment
> > does not match the block device...
> Yeah, that makes sense. Especially we do R/W for unaligned head/tail only..
> 
> Actually my major concern is a broken store_test for now. Should we force
> aligned-only offsets there at the moment?

Yeah, let's do that!

sage


> > 
> > sage
> > 
> > > 2017-01-30 03:57:17.802440 7f0036a20700 15
> > > bluestore(bluestore.test_temp_dir)
> > > read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
> > > 0
> > > 2017-01-30 03:57:17.802448 7f0036a20700 30
> > > bluestore.OnodeSpace(0x55eb49789b78
> > > in 0x55eb45dd0620) lookup
> > > 2017-01-30 03:57:17.802450 7f0036a20700 30
> > > bluestore.OnodeSpace(0x55eb49789b78
> > > in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
> > > b# hit 0x55eb49874700
> > > 2017-01-30 03:57:17.802453 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read 0x0~19e00 size 0x19e00 (105984)
> > > 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
> > > flush done
> > > 2017-01-30 03:57:17.802456 7f0036a20700 30
> > > bluestore.extentmap(0x55eb49874850)
> > > fault_range 0x0~19e00
> > > 2017-01-30 03:57:17.802457 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
> > > b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
> > > expected_write_size 4096 in 0 shards
> > > 2017-01-30 03:57:17.802461 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map  0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
> > > 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000
> > > unused=0xff)
> > > ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
> > > loaded shared_blob(ref_map(0x40190000~4000=2))))
> > > 2017-01-30 03:57:17.802469 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map      csum: [0,0,f1e4ed4a,417bbe91]
> > > 2017-01-30 03:57:17.802472 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map  0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
> > > b80 blob([0x40194000~18000] csum+shared crc32c/0x1000)
> > > ref_map(0x0~12a00=1)
> > > SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
> > > _map(0x40194000~18000=3))))
> > > 2017-01-30 03:57:17.802479 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map      csum: [d1f849c5,fbe516b8,518379f8,b8
> > > b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
> > > 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
> > > 2017-01-30 03:57:17.802484 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
> > > ce 0x55eb4c2f51d8 0x0~18000 clean)
> > > 2017-01-30 03:57:17.802487 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  hole 0x0~6c00
> > > 2017-01-30 03:57:17.802490 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> > > csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> > > SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
> > > (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
> > > 2017-01-30 03:57:17.802495 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read    will read 0x6c00: 0x3800~800
> > > 2017-01-30 03:57:17.802509 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
> > >   csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
> > > SharedBlob(0x55eb4c2f5180
> > > sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
> > >   need 0x0~12a00 cache has 0x[0~12a00]
> > > 2017-01-30 03:57:17.802515 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read    use cache 0x7400: 0x0~12a00
> > > 2017-01-30 03:57:17.802519 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> > > csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> > > SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
> > > shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
> > > 2017-01-30 03:57:17.802529 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read    region 0x6c00: 0x3800~800 reading 0x3000~1000
> > > 
> > > I haven't unwind all the clone_range transformations that lead to this
> > > state
> > > yet. In the example above source object already has the same unaligned
> > > extents
> > > issue.
> > > 
> > > But anyway it appears that clone_range neither care nor assert on
> > > unaligned
> > > input offsets...
> > > 
> > > I can share a couple of logs if needed..
> > > 
> > > Any comments?
> > > 
> > > Thanks,
> > > 
> > > Igor
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux