On Wed, Nov 27, 2019 at 07:06:01PM -0800, Hugh Dickins wrote: > On Tue, 26 Nov 2019, Yang Shi wrote: > > On 11/25/19 11:33 AM, Yang Shi wrote: > > > On 11/25/19 10:33 AM, Kirill A. Shutemov wrote: > > > > On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: > > > > > On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: > > > > > > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: > > > > > > > Currently when truncating shmem file, if the range is partial of > > > > > > > THP > > > > > > > (start or end is in the middle of THP), the pages actually will > > > > > > > just get > > > > > > > cleared rather than being freed unless the range cover the whole > > > > > > > THP. > > > > > > > Even though all the subpages are truncated (randomly or > > > > > > > sequentially), > > > > > > > the THP may still be kept in page cache. This might be fine for > > > > > > > some > > > > > > > usecases which prefer preserving THP. > > > > > > > > > > > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole > > > > > > > punch > > > > > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not > > > > > > > used. > > > > > > > So, when using shmem THP as memory backend QEMU inflation actually > > > > > > > doesn't > > > > > > > work as expected since it doesn't free memory. But, the inflation > > > > > > > usecase really needs get the memory freed. Anonymous THP will not > > > > > > > get > > > > > > > freed right away too but it will be freed eventually when all > > > > > > > subpages are > > > > > > > unmapped, but shmem THP would still stay in page cache. > > > > > > > > > > > > > > To protect the usecases which may prefer preserving THP, introduce > > > > > > > a > > > > > > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP > > > > > > > is > > > > > > > preferred behavior if truncating partial THP. This mode just makes > > > > > > > sense to tmpfs for the time being. > > Sorry, I haven't managed to set aside enough time for this until now. > > First off, let me say that I firmly believe this punch-split behavior > should be the standard behavior (like in my huge tmpfs implementation), > and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it. > But I don't know if I'll be able to persuade Kirill of that. > > If the caller wants to write zeroes into the file, she can do so with the > write syscall: the caller has asked to punch a hole or truncate the file, > and in our case, like your QEMU case, hopes that memory and memcg charge > will be freed by doing so. I'll be surprised if changing the behavior > to yours and mine turns out to introduce a regression, but if it does, > I guess we'll then have to put it behind a sysctl or whatever. > > IIUC the reason that it's currently implemented by clearing the hole > is because split_huge_page() (unlike in older refcounting days) cannot > be guaranteed to succeed. Which is unfortunate, and none of us is very > keen to build a filesystem on unreliable behavior; but the failure cases > appear in practice to be rare enough, that it's on balance better to give > the punch-hole-truncate caller what she asked for whenever possible. I don't have a firm position here. Maybe you are right and we should try to split pages right away. It might be useful to consider case wider than shmem. On traditional filesystem with a backing storage semantics of the same punch hole operation is somewhat different. It doesn't have explicit implications on memory footprint. It's about managing persistent storage. With shmem/tmpfs it is lumped together. It might be nice to write down pages that can be discarded under memory pressure and leave the huge page intact until then... [ I don't see a problem with your patch as long as we agree that it's desired semantics for the interface. ] -- Kirill A. Shutemov