On Wed 17-04-19 18:28:58, Jerome Glisse wrote: > On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote: > > On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@xxxxxxxxxx> wrote: > > > > > > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote: > > > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet > > > > <kent.overstreet@xxxxxxxxx> wrote: > > > > > > > > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote: > > > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@xxxxxxxxxx wrote: > > > > > > > From: Jérôme Glisse <jglisse@xxxxxxxxxx> > > > > > > > > > > > > > > This patchset depends on various small fixes [1] and also on patchset > > > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those > > > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now > > > > > > > so that it can get review and comments on how and what should be done > > > > > > > to test things. > > > > > > > > > > > > > > For various reasons [2] [3] we want to track page reference through GUP > > > > > > > differently than "regular" page reference. Thus we need to keep track > > > > > > > of how we got a page within the block and fs layer. To do so this patch- > > > > > > > set change the bio_bvec struct to store a pfn and flags instead of a > > > > > > > direct pointer to a page. This way we can flag page that are coming from > > > > > > > GUP. > > > > > > > > > > > > > > This patchset is divided as follow: > > > > > > > - First part of the patchset is just small cleanup i believe they > > > > > > > can go in as his assuming people are ok with them. > > > > > > > > > > > > > > > > > > > - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is > > > > > > > done in multi-step, first we replace all direct dereference of > > > > > > > the field by call to inline helper, then we introduce macro for > > > > > > > bio_bvec that are initialized on the stack. Finaly we change the > > > > > > > bv_page field to bv_pfn. > > > > > > > > > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr > > > > > > as a flag (pointer always aligned to 64 bytes in our case). > > > > > > > > > > > > So yes we need an inline helper for reference of the page but is it not clearer > > > > > > that we assume a page* and not any kind of pfn ? > > > > > > It will not be the first place using low bits of a pointer for flags. > > > > > > > > > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist > > > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she > > > > > > can just submit them as two separate BIOs (chained at the block layer). > > > > > > > > > > > > Many users just submit one page bios and let elevator merge them any way. > > > > > > > > > > Let's please not add additional flags and weirdness to struct bio - "if this > > > > > flag is set interpret one way, if not interpret another" - or eventually bios > > > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn. > > > > > > > > This all reminds of the failed attempt to teach the block layer to > > > > operate without pages: > > > > > > > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > > > > > > > > > > > > > Question though - why do we need a flag for whether a page is a GUP page or not? > > > > > Couldn't the needed information just be determined by what range the pfn is not > > > > > (i.e. whether or not it has a struct page associated with it)? > > > > > > > > That amounts to a pfn_valid() check which is a bit heavier than if we > > > > can store a flag in the bv_pfn entry directly. > > > > > > > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than > > > > an 'unsigned long'. > > > > > > > > That said, I'm still in favor of Jan's proposal to just make the > > > > bv_page semantics uniform. Otherwise we're complicating this core > > > > infrastructure for some yet to be implemented GPU memory management > > > > capabilities with yet to be determined value. Circle back when that > > > > value is clear, but in the meantime fix the GUP bug. > > > > > > This has nothing to do with GPU, what make you think so ? Here i am > > > trying to solve GUP and to keep the value of knowing wether a page > > > has been GUP or not. I argue that if we bias every page in every bio > > > then we loose that information and thus the value. > > > > > > I gave the page protection mechanisms as an example that would be > > > impacted but it is not the only one. Knowing if a page has been GUP > > > can be useful for memory reclaimation, compaction, NUMA balancing, > > > > Right, this is what I was reacting to in your pushback to Jan's > > proposal. You're claiming value for not doing the simple thing for > > some future "may be useful in these contexts". To my knowledge those > > things are not broken today. You're asking for the complexity to be > > carried today for some future benefit, and I'm asking for the > > simplicity to be maintained as much as possible today and let the > > value of future changes stand on their own to push for more complexity > > later. > > > > Effectively don't use this bug fix to push complexity for a future > > agenda where the value has yet to be quantified. > > Except that this solution (biasing everyone in bio) would _more complex_ > it is only conceptualy appealing. The changes are on the other hand much > deeper and much riskier but you decided to ignore that and focus on some- > thing i was just giving as an example. Yeah, after going and reading several places like fs/iomap.c, fs/mpage.c, drivers/md/dm-io.c I agree with you. The places that are not doing direct IO usually just don't hold any page reference that could be directly attributed to the bio (and they don't drop it when bio finishes). They rather use other means (like PageLocked, PageWriteback) to make sure the page stays alive so mandating gup-pin reference for all pages attached to a bio would require a lot of reworking of places that are not related to our problem and currently work just fine. So I withdraw my suggestion. Nice in theory, too much work in practice ;). Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR