A couple of thoughts. First of all, one of the reasons why this probably hasn't been addressed for so long is because programs who really care about issues like this tend to use Direct I/O, and don't use the page cache at all. And perhaps this is an option open to qemu as well? Secondly, one of the reasons why we mark the page clean is because we didn't want a failing disk to memory to be trapped with no way of releasing the pages. For example, if a user plugs in a USB thumbstick, writes to it, and then rudely yanks it out before all of the pages have been writeback, it would be unfortunate if the dirty pages can only be released by rebooting the system. So an approach that might work is fsync() will keep the pages dirty --- but only while the file descriptor is open. This could either be the default behavior, or something that has to be specifically requested via fcntl(2). That way, as soon as the process exits (at which point it will be too late for it do anything to save the contents of the file) we also release the memory. And if the process gets OOM killed, again, the right thing happens. But if the process wants to take emergency measures to write the file somewhere else, it knows that the pages won't get lost until the file gets closed. (BTW, a process could guarantee this today without any kernel changes by mmap'ing the whole file and mlock'ing the pages that it had modified. That way, even if there is an I/O error and the fsync causes the pages to be marked clean, the pages wouldn't go away. However, this is really a hack, and it would probably be easier for the process to use Direct I/O instead. :-) Finally, if the kernel knows that an error might be one that could be resolved by the simple expedient of waiting (for example, if a fibre channel cable is temporarily unplugged so it can be rerouted, but the user might plug it back in a minute or two later, or a dm-thin device is full, but the system administrator might do something to fix it), in the ideal world, the kernel should deal with it without requiring any magic from userspace applications. There might be a helper system daemon that enacts policy (we've paged the sysadmin, so it's OK to keep the page dirty and retry the writebacks to the dm-thin volume after the helper daemon gives the all-clear), but we shouldn't require all user space applications to have magic, Linux-specific retry code. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html