On Wed, 14 Jun 2023 14:59:33 -0700 (PDT) Hugh Dickins <hughd@xxxxxxxxxx> wrote: [...] > > > > It would be much more acceptable to simply not add back such fragments > > to the list, and therefore risking some memory waste, than risking to > > use an unstable mm in __tlb_remove_table(). The amount of wasted memory > > in practice might also not be a lot, depending on whether the fragments > > belong to the same and contiguous mapping. > > I agree that it's much better to waste a little memory (and only temporarily) > than to freeze or corrupt. But it's not an insoluble problem, I just didn't > want to get into more change if there was already an answer that covers it. > > I assume that the freed mm issue scared you away from testing my patch, > so we don't know whether I got those mask conditionals right or not? Correct, that scared me a lot :-). On the one hand, I do not feel familiar enough with the common code logic that might need to be changed, or at least understood, in order to judge if this is a problem and how it could be addressed. On the other hand, I am scared of subtle bugs that would not show immediately, and hit us by surprise later. Your thoughts about using RCU to free mm, in order to address this "unstable mm" in __tlb_remove_table(), sound like a reasonable approach. But again, with my lack of understanding, I am not sure if I can cope with that. So at least be prepared for call backs on that issue, not by RCU but by mail :-) > > > > > Also, we would not need to use page->pt_mm, and therefore make room for > > page->pt_frag_refcount, which for some reason is (still) being used > > in new v4 from Vishals "Split ptdesc from struct page" series... > > Vishal's ptdesc: I've been ignoring as far as possible, I'll have to > respond on that later today, I'm afraid it will be putting this all into > an intolerable straitjacket. If ptdesc is actually making more space > available by some magic, great: but I don't expect to find that is so. > Anyway, for now, there it's impossible (for me anyway) to think of that > at the same time as this. I can totally relate to that. And I also had the feeling and hope that ptdesc would give some relief on complex struct page (mis-)use, but did not yet get into investigating further. [...] > > I dot not fully understand if / why we need the new HH bits. While working > > on my patch it seemed to be useful for sorting out list_add/del in the > > various cases. Here it only seems to be used for preventing double rcu_head > > usage, is this correct, or am I missing something? > > Correct, I only needed the HH bits for avoiding double rcu_head usage (then > implementing the second usage one the first has completed). If you want to > distinguish pte_free_defer() tables from page_table_free_rcu() tables, the > HH bits would need to be considered in other places too, I imagine: that > gets more complicated, I fear. Yes, I have the same impression. My approach would prevent scary "unstable mm" issues in __tlb_remove_table(), but probably introduce other subtle issue. Or not so subtle, like potential double list_free(), as mentioned in my last reply. So it seems we have no completely safe approach so far, but I would agree on going further with your approach for now. See below for patch comments. [...] > > I'm getting this reply back to you, before reviewing your patch below. > Probably the only way I can review yours is to try it myself and compare. > I'll look into it, once I understand c2c224932fd0. But may have to write > to Vishal first, or get the v2 of my series out: if only I could work out > a safe and easy way of unbreaking s390... > > Is there any chance of you trying mine? > But please don't let it waste your time. I have put it to some LTP tests now, and good news is that it does not show any obvious issues. Only some deadlocks on mm->context.lock, but that can easily be solved. Problem is that we have some users of that lock, who do spin_lock() and not spin_lock_bh(). In the past, we had 3 different locks in mm->context, and then combined them to use the same. Only the pagetable list locks were taken with spin_lock_bh(), the others used spin_lock(). Of course, after combining them to use the same lock, it would have been required to change the others to also use spin_lock_bh(), at least if there was any good reason for using _bh in the pagetable list lock. It seems there was not, which is why that mismatch was not causing any issues so far, probably we had some reason which got removed in one of the various reworks of that code... With your patch, we do now have a reason, because __tlb_remove_table() will usually be called in _bh context as RCU callback, and now also takes that lock. So we also need this change (and two compile fixes, marked below): --- a/arch/s390/include/asm/tlbflush.h +++ b/arch/s390/include/asm/tlbflush.h @@ -79,12 +79,12 @@ static inline void __tlb_flush_kernel(vo static inline void __tlb_flush_mm_lazy(struct mm_struct * mm) { - spin_lock(&mm->context.lock); + spin_lock_bh(&mm->context.lock); if (mm->context.flush_mm) { mm->context.flush_mm = 0; __tlb_flush_mm(mm); } - spin_unlock(&mm->context.lock); + spin_unlock_bh(&mm->context.lock); } /* --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -102,14 +102,14 @@ struct gmap *gmap_create(struct mm_struc if (!gmap) return NULL; gmap->mm = mm; - spin_lock(&mm->context.lock); + spin_lock_bh(&mm->context.lock); list_add_rcu(&gmap->list, &mm->context.gmap_list); if (list_is_singular(&mm->context.gmap_list)) gmap_asce = gmap->asce; else gmap_asce = -1UL; WRITE_ONCE(mm->context.gmap_asce, gmap_asce); - spin_unlock(&mm->context.lock); + spin_unlock_bh(&mm->context.lock); return gmap; } EXPORT_SYMBOL_GPL(gmap_create); @@ -250,7 +250,7 @@ void gmap_remove(struct gmap *gmap) spin_unlock(&gmap->shadow_lock); } /* Remove gmap from the pre-mm list */ - spin_lock(&gmap->mm->context.lock); + spin_lock_bh(&gmap->mm->context.lock); list_del_rcu(&gmap->list); if (list_empty(&gmap->mm->context.gmap_list)) gmap_asce = 0; @@ -260,7 +260,7 @@ void gmap_remove(struct gmap *gmap) else gmap_asce = -1UL; WRITE_ONCE(gmap->mm->context.gmap_asce, gmap_asce); - spin_unlock(&gmap->mm->context.lock); + spin_unlock_bh(&gmap->mm->context.lock); synchronize_rcu(); /* Put reference */ gmap_put(gmap); These are the compile fixes: > +static void pte_free_now1(struct rcu_read *head) rcu_read -> rcu_head > +{ > + pte_free_half(head, 1); > +} > + > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > +{ > + unsigned int bit, mask; > + struct page *page; > + > + page = virt_to_page(pgtable); > + WARN_ON_ONCE(page->pt_mm != mm); > + if (mm_alloc_pgste(mm)) { > + call_rcu(&page->rcu_head, pte_free_pgste) Missing ";" at the end > + return; > + }