On 5/18/21 8:56 PM, Jane Chu wrote: > On 5/18/2021 10:27 AM, Joao Martins wrote: > >> On 5/5/21 11:36 PM, Joao Martins wrote: >>> On 5/5/21 11:20 PM, Dan Williams wrote: >>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >>>>> On 5/5/21 7:44 PM, Dan Williams wrote: >>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>> index b46f63dcaed3..bb28d82dda5e 100644 >>>>>>> --- a/include/linux/memremap.h >>>>>>> +++ b/include/linux/memremap.h >>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap { >>>>>>> struct completion done; >>>>>>> enum memory_type type; >>>>>>> unsigned int flags; >>>>>>> + unsigned long align; >>>>>> I think this wants some kernel-doc above to indicate that non-zero >>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE >>>>>> means "use non-compound base pages". >> [...] >> >>>>>> The non-zero value must be >>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. >>>>>> Hmm, maybe it should be an >>>>>> enum: >>>>>> >>>>>> enum devmap_geometry { >>>>>> DEVMAP_PTE, >>>>>> DEVMAP_PMD, >>>>>> DEVMAP_PUD, >>>>>> } >>>>>> >>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe >>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)? >>>> I think it is ok for dax/nvdimm to continue to maintain their align >>>> value because it should be ok to have 4MB align if the device really >>>> wanted. However, when it goes to map that alignment with >>>> memremap_pages() it can pick a mode. For example, it's already the >>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so >>>> they're already separate concepts that can stay separate. >>>> >>> Gotcha. >> I am reconsidering part of the above. In general, yes, the meaning of devmap @align >> represents a slightly different variation of the device @align i.e. how the metadata is >> laid out **but** regardless of what kind of page table entries we use vmemmap. >> >> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already >> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE) >> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace >> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred >> above? -- and 3) the value of geometry actually derives from dax device @align because we >> will need to create compound pages representing a page size of @align value. >> >> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs, >> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs >> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what >> device @align. In reality what we want to convey in @geometry is not page table sizes, but >> just the page size used for the vmemmap of the dax device. Additionally, limiting its >> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm >> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to >> create compound pages of order 10 for the said vmemmap. >> >> I am going to wait until you finish reviewing the remaining four patches of this series, >> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without >> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a >> setter/getter to audit its value? Thoughts? > > Good points there. > > My understanding is that dax->align conveys granularity of size while > carving out a namespace it's a geometry attribute loosely akin to sector size of a spindle > disk. I tend to think that device pagesize has almost no relation to "align" in that, it's > possible to have 1G "align" and 4K pagesize, or verse versa. That is, with the advent of compound page > support, it is possible to totally separate the two concepts. > > How about adding a new option to "ndctl create-namespace" that describes > device creator's desired pagesize, and another parameter to describe whether the pagesize shall > be fixed or allowed to be split up, such that, if the intention is to never split up 2M pagesize, then it > would be possible to save a lot metadata space on the device? Maybe that can be selected by the driver too, but it's an interesting point you raise should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your suggestion?). device-dax for example would use geometry == align and therefore save space (like what I propose in patch 10). But fsdax would retain the default that is geometry = PAGE_SIZE and align = PMD_SIZE should it want to split pages. Interestingly, devmap poisoning always occur at @align level regardless of @geometry. What I am not sure is what value (vs added complexity) it brings to allow geometry *value* to be selecteable by user given that so far we seem to only ever initialize metadata as either sets of base pages [*] or sets of compound pages (of a size). And the difference between both can possibly be summarized to split-ability like you say. [*] that optionally can are morphed into compound pages by driver