Re: [RFC] vhost-blk implementation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Christoph Hellwig wrote:
Inspired by vhost-net implementation, I did initial prototype of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
I haven't handled all the error cases, fixed naming conventions etc.,
but the implementation is stable to play with. I tried not to deviate
from vhost-net implementation where possible.

Can you also send the qemu side of it?

with vhost-blk:
----------------

# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
640000+0 records in
640000+0 records out
83886080000 bytes (84 GB) copied, 126.135 seconds, 665 MB/s

real    2m6.137s
user    0m0.281s
sys     0m14.725s

without vhost-blk: (virtio)
---------------------------

# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
640000+0 records in
640000+0 records out
83886080000 bytes (84 GB) copied, 275.466 seconds, 305 MB/s

real    4m35.468s
user    0m0.373s
sys     0m48.074s

Which caching mode is this?  I assume data=writeback, because otherwise
you'd be doing synchronous I/O directly from the handler.

Yes. This is with default (writeback) cache model. As mentioned earlier, readhead is helping here
and most cases, data would be ready in the pagecache.
+static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
+			struct iovec *iov, int in)
+{
+	loff_t pos = sector << 8;
+	int ret = 0;
+
+	if (type & VIRTIO_BLK_T_FLUSH)  {
+		ret = vfs_fsync(file, file->f_path.dentry, 1);
+	} else if (type & VIRTIO_BLK_T_OUT) {
+		ret = vfs_writev(file, iov, in, &pos);
+	} else {
+		ret = vfs_readv(file, iov, in, &pos);
+	}
+	return ret;

I have to admit I don't understand the vhost architecture at all, but
where do the actual data pointers used by the iovecs reside?
vfs_readv/writev expect both the iovec itself and the buffers
pointed to by it to reside in userspace, so just using kernel buffers
here will break badly on architectures with different user/kernel
mappings.  A lot of this is fixable using simple set_fs & co tricks,
but for direct I/O which uses get_user_pages even that will fail badly.
iovecs and buffers are user-space pointers (from the host kernel point of view). They are
guest address. So, I don't need to do any set_fs tricks.
Also it seems like you're doing all the I/O synchronous here?  For
data=writeback operations that could explain the read speedup
as you're avoiding context switches, but for actual write I/O
which has to get data to disk (either directly from vfs_writev or
later through vfs_fsync) this seems like a really bad idea stealing
a lot of guest time that should happen in the background.
Yes. QEMU virtio-blk is batching up all the writes and handing of the work to another thread. When the writes() are complete, its sending a status completion. Since I am doing everything synchronous (even though its write to pagecache) one request at a
time, that explains the slow down. We need to find a way to

1) batch IO writes together
2) hand off to another thread to do the IO, so that vhost-thread can handle
next set of requests
3) update the status on the completion

What do should I do here ? I can create bunch of kernel threads to do the IO for me. Or some how fit and reuse AIO io_submit() mechanism. Whats the best way here ?
I hate do duplicate all the code VFS is doing.

Other than that the code seems quite nice and simple, but one huge
problem is that it'll only support raw images, and thus misses out
on all the "nice" image formats used in qemu deployments, especially
qcow2.  It's also missing the ioctl magic we're having in various
places, both for controlling host devices like cdroms and SG
passthrough.
True... unfortunately, I don't understand all of those (qcow2) details yet !! I need to read up on those,
to even make a comment :(

Thanks,
Badari


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux