Re: Slow RBD copy

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 2 Apr 2013 09:02:11 -0700



On Sat, Mar 30, 2013 at 3:46 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> On 03/29/2013 01:42 AM, Steve Carter wrote:
>>
>> I create an empty 150G volume them copy it to a second pool:
>>
>> # rbd -p pool0 create --size 153750 steve150
>>
>> # /usr/bin/time rbd cp pool0/steve150 pool1/steve150
>> Image copy: 100% complete...done.
>> 303.44user 233.40system 1:52:10elapsed 7%CPU (0avgtext+0avgdata
>> 248832maxresident)k
>>
>> Notice there is no data in the steve150 volume.
>>
>> I then repeat with a 100G volume that is full, filled using dd:
>>
>> # /usr/bin/time rbd cp pool0/steve100 pool1/steve100
>> Image copy: 100% complete...done.
>> 338.81user 573.55system 2:13:05elapsed 11%CPU (0avgtext+0avgdata
>> 201712maxresident)k
>>
>> I see threads hitting files in the mon data quite a bit and 35% IOwait
>> for the whole time the copy is running.
>>
>> I then repeat the above measurements using a tmpfs/ramdisk for the mon
>> data.  The improvement in time is about 10%, which indicates my spinning
>> disks are responsible for 10% of the copy latency. Replacing the
>> spinning disks with SSDs would likely improve the copy performance.
>>
>> What I don't understand is what is taking so long with the first copy,
>> the empty volume?  If I am correct that there are a small number of
>> objects associated with this volume, I wonder what is happening during
>> the copy process?
>>
>
> RBD images are sparse, which means that no object is created until data is
> actually written to it, so I get your reasoning why you think an empty
> volume should be copied instantly.
>
> However, RBD works by calculating which object corresponds with which part
> of the image, there is no list which tells which objects exist and which
> don't.
>
> So during the copy all objects which could possibly exist are probed if they
> exist and then copied.
>
> During you test with a volume written to by dd you have to copy actual data
> as well, so that's why it takes some more time.
>
> The monitor I/O load you are seeing is due to the map updates which are
> happening with each write, which takes time since the monitor calls sync() a
> lot of times to be sure it is consistent.

That's not quite correct — the monitor maintains data which is updated
constantly by all the OSDs. They update it a little more often when
there's activity in the cluster, but it's not "with each write". That
would be quite the scaling bottleneck. ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com