On Sat, Mar 30, 2013 at 3:46 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > On 03/29/2013 01:42 AM, Steve Carter wrote: >> >> I create an empty 150G volume them copy it to a second pool: >> >> # rbd -p pool0 create --size 153750 steve150 >> >> # /usr/bin/time rbd cp pool0/steve150 pool1/steve150 >> Image copy: 100% complete...done. >> 303.44user 233.40system 1:52:10elapsed 7%CPU (0avgtext+0avgdata >> 248832maxresident)k >> >> Notice there is no data in the steve150 volume. >> >> I then repeat with a 100G volume that is full, filled using dd: >> >> # /usr/bin/time rbd cp pool0/steve100 pool1/steve100 >> Image copy: 100% complete...done. >> 338.81user 573.55system 2:13:05elapsed 11%CPU (0avgtext+0avgdata >> 201712maxresident)k >> >> I see threads hitting files in the mon data quite a bit and 35% IOwait >> for the whole time the copy is running. >> >> I then repeat the above measurements using a tmpfs/ramdisk for the mon >> data. The improvement in time is about 10%, which indicates my spinning >> disks are responsible for 10% of the copy latency. Replacing the >> spinning disks with SSDs would likely improve the copy performance. >> >> What I don't understand is what is taking so long with the first copy, >> the empty volume? If I am correct that there are a small number of >> objects associated with this volume, I wonder what is happening during >> the copy process? >> > > RBD images are sparse, which means that no object is created until data is > actually written to it, so I get your reasoning why you think an empty > volume should be copied instantly. > > However, RBD works by calculating which object corresponds with which part > of the image, there is no list which tells which objects exist and which > don't. > > So during the copy all objects which could possibly exist are probed if they > exist and then copied. > > During you test with a volume written to by dd you have to copy actual data > as well, so that's why it takes some more time. > > The monitor I/O load you are seeing is due to the map updates which are > happening with each write, which takes time since the monitor calls sync() a > lot of times to be sure it is consistent. That's not quite correct — the monitor maintains data which is updated constantly by all the OSDs. They update it a little more often when there's activity in the cluster, but it's not "with each write". That would be quite the scaling bottleneck. ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com