Just for completeness, this is what we think we have going on:
1) First boot: Running on the host, pygrub reads the device (unmounted)
to bootstrap the guest with image A (kernel + grub.conf)
(2) guest image updates the kernel/grub.conf using weird Xen IO path
(bypassing the host page cache, creating BIO's directly in the host
memory).
Note that at this point in time, the image from (1) is still possibly in
the page cache of the host
(3) reboot of guest - host pygrub uses page cache (stale pages) when
bootstrapping the guest who occasionally boots into the stale image.
This doesn't happen all of the time - if the pages are dropped before
(3) happens, it won't happen.
Uses O_DIRECT has a side effect of invalidating the page cache while
reading.
Dropping VM caches just for the devices in question would fix this or
(as Christoph mentioned) use kvm which does this more sanely :-)
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html