On Wed, Nov 16, 2022 at 05:00:08PM -0800, James Houghton wrote: > On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote: > > > +struct hugetlb_pte { > > > + pte_t *ptep; > > > + unsigned int shift; > > > + enum hugetlb_level level; > > > + spinlock_t *ptl; > > > +}; > > > > Do we need both shift + level? Maybe it's only meaningful for ARM where > > the shift may not be directly calculcated from level? > > > > I'm wondering whether we can just maintain "shift" then we calculate > > "level" realtime. It just reads a bit weird to have these two fields, also > > a burden to most of the call sites where shift and level exactly match.. > > My main concern is interaction with folded levels. For example, if > PUD_SIZE and PMD_SIZE are the same, we want to do something like this: > > pud = pud_offset(p4d, addr) > pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */ > pte = pte_offset(pmd, addr) > > and I think we should avoid quietly skipping the folded level, which > could happen: > > pud = pud_offset(p4d, addr) > /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here, > it is impossible to know that `pud` came from `pud_offset` and not > `pmd_offset`. We must assume the deeper level so that we don't get > stuck in a loop. */ > pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * -> > pmd_t *) */ > > Quietly dropping p*d_offset for folded levels is safe; it's just a > cast that we're doing anyway. If you think this is fine, then I can > remove `level`. It might also be that this is a non-issue and that > there will never be a folded level underneath a hugepage level. > > We could also change `ptep` to a union eventually (to clean up > "hugetlb casts everything to pte_t *" messiness), and having an > explicit `level` as a tag for the union would be nice help. In the > same way: I like having `level` explicitly so that we know for sure > where `ptep` came from. Makes sense. > > I can try to reduce the burden at the callsite while keeping `level`: > hpage_size_to_level() is really annoying to have everywhere. Yeah this would be nice. Per what I see most callers are having paired level/shift, so maybe we can make hugetlb_hgm_iter_init() to only take one of them and calculate the other. Then there can also be the __hugetlb_hgm_iter_init() which takes both, only used in the few places where necessary to have explicit level/shift. hugetlb_hgm_iter_init() calls __hugetlb_hgm_iter_init(). Thanks, -- Peter Xu