On Fri, 15 Sep 2023 19:17:51 +0100 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > On Fri, Sep 15, 2023 at 07:54:50PM +0200, Claudio Imbrenda wrote: > > On Fri, 15 Sep 2023 18:28:25 +0100 > > "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> wrote: > > > > > We introduced arch_make_folio_accessible() a couple of years > > > ago, and it's in use in the page writeback path. GUP still uses > > > arch_make_page_accessible(), which means that we can succeed in making > > > a single page of a folio accessible, then fail to make the rest of the > > > folio accessible when it comes time to do writeback and it's too late > > > to do anything about it. I'm not sure how much of a real problem this is. > > > > > > Switching everything around to arch_make_folio_accessible() also lets > > > us switch the page flag to be per-folio instead of per-page, which is > > > a good step towards dynamically allocated folios. > > > > if I understand correctly, this will as a matter of fact move the > > security property from pages to folios. > > Correct. > > > this means that trying to access a page will (try to) make the whole > > folio accessible, even though that might be counterproductive.... > > > > and there is no way to simply split a folio > > > > I don't like this > > As I said in the cover letter, we already make the entire folio > accessible in the writeback path. I suppose if you never write the > folio back, this is new ... > > Anyway, looking forward to a more substantial discussion on Monday. this will be a wall of text, sorry first some definitions: * secure page page belonging to a secure guest, accessible by the guest, not accessible by the host * shared page page belonging to a secure guest, accessible by the guest and by the host. the guest decides which pages to share * pinned shared the host can force a page to stay shared; when the guest wants to unshare it, a vm-exit event happens. this allows the host to make sure the page is allowed become secure again before allowing the page to be unshared * exported page with guest content, encrypted and integrity protected, no longer accessible by the guest, but accessible by the host * made accessible a page is pinned shared, or, if that fails, exported. now let's see how we use the arch-bit in struct page: the arch-bit that is used to track whether the page is secure or not. the bit MUST be set whenever a page is secure, and MAY be set for non-secure pages. in general it should not be set for non-secure pages, for performance reasons. sometimes we have small windows where non-secure pages can have the bit set. sometimes the arch-bit is used to determine whether further processing (e.g. export) is needed. when a page transitions from non-secure to secure, we must make sure that no I/O is possibly happening on it. we do this in 2 ways at the same time: * make sure the page is not in writeback, and if so wait until the writeback is over * make sure the page does not have any extra references except for the mapping itself. this means no-one else is trying to use the page for any other purpose, e.g. direct I/O. >> If the page has extra references, we wait until they are dropped << each time a guest tries to touch an exported page, it gets imported. it's important to track the security property of each page individually. and this is the most important thing, and the root of all our issues: >> we MUST NOT do any kind of I/O on secure pages << now let's see what happens for writeback. if a folio is in writeback, all of it is going to be written back. so all of it needs to be made accessible. once the I/O is over, the pages can be accessed again. let's see what happens for other types of I/O (e.g. direct I/O) currently, when we try to do direct I/O on a specific page, pin_user_pages() will cause the pages to be made accessible. the other neighbouring pages will stay secure and keep the arch bit set. if the security property tracking gets moved to a whole folio, then you can have a situation where the guest asks for I/O on a shared page, and then tries to access a neighbouring page... which will hang until the I/O is done, since the rest of folio has now been exported. even worse, virtio requires individual guest pages to be shared with the host. the host needs to do pin_user_pages() on those pages to make sure they stay there, since they will be accessed directly. the pin stays as long as the corresponding virtio devices are configured (i.e. until the guest shuts down or reboots). with your changes, the whole folio will be made accessible, both the shared page (pin shared) and the remaining pages in the folio (exported). when the guest tries to access the remaining pages, it will fail, triggering an import in the host. the host will then proceed to wait forever..... if I understand correctly, we have no control over the size of the folios, and no way to split folios, so there is no solution to this.