Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

Avi Kivity <avi@xxxxxxxxxx> · Tue, 16 Mar 2010 13:08:28 +0200

On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:

Are you talking about direct volume access or qcow2?

Doesn't matter.

For direct volume access, I still don't get it.  The number of barriers
issues by the host must equal (or exceed, but that's pointless) the
number of barriers issued by the guest.  cache=writeback allows the host
to reorder writes, but so does cache=none.  Where does the difference
come from?

Put it another way.  In an unvirtualized environment, if you implement a
write cache in a storage driver (not device), and sync it on a barrier
request, would you expect to see a performance improvement?

cache=none only allows very limited reorderning in the host.  O_DIRECT
is synchronous on the host, so there's just some very limited reordering
going on in the elevator if we have other I/O going on in parallel.

Presumably there is lots of I/O going on, or we wouldn't be having this 
conversation.

In addition to that the disk writecache can perform limited reodering
and caching, but the disk cache has a rather limited size.  The host
pagecache gives a much wieder opportunity to reorder, especially if
the guest workload is not cache flush heavy.  If the guest workload
is extremly cache flush heavy the usefulness of the pagecache is rather
limited, as we'll only use very little of it, but pay by having to do
a data copy.  If the workload is not cache flush heavy, and we have
multiple guests doing I/O to the same spindles it will allow the host
do do much more efficient data writeout by beeing able to do better
ordered (less seeky) and bigger I/O (especially if the host has real
storage compared to ide for the guest).

Let's assume the guest has virtio (I agree with IDE we need reordering 
on the host).  The guest sends batches of I/O separated by cache 
flushes.  If the batches are smaller than the virtio queue length, 
ideally things look like:

 io_submit(..., batch_size_1);
 io_getevents(..., batch_size_1);
 fdatasync();
 io_submit(..., batch_size_2);
  io_getevents(..., batch_size_2);
  fdatasync();
  io_submit(..., batch_size_3);
  io_getevents(..., batch_size_3);
  fdatasync();

(certainly that won't happen today, but it could in principle).

How does a write cache give any advantage?  The host kernel sees 
_exactly_ the same information as it would from a bunch of threaded 
pwritev()s followed by fdatasync().

(wish: IO_CMD_ORDERED_FDATASYNC)

If the batch size is larger than the virtio queue size, or if there are 
no flushes at all, then yes the huge write cache gives more opportunity 
for reordering.  But we're already talking hundreds of requests here.

Let's say the virtio queue size was unlimited.  What merging/reordering 
opportunity are we missing on the host?  Again we have exactly the same 
information: either the pagecache lru + radix tree that identifies all 
dirty pages in disk order, or the block queue with pending requests that 
contains exactly the same information.

Something is wrong.  Maybe it's my understanding, but on the other hand 
it may be a piece of kernel code.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>