On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote: > On Fri, 4 Mar 2016, Dave Hansen wrote: > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote: > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote: > > >> Truncate and punch hole that only cover part of THP range is implemented > > >> by zero out this part of THP. > > >> > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have > > >> inconsistent results depending what pages happened to be allocated. > > >> Not sure if it should be considered ABI break or not. > > > > > > Looks like this shouldn't be a problem. man 2 fallocate: > > > > > > Within the specified range, partial filesystem blocks are zeroed, > > > and whole filesystem blocks are removed from the file. After a > > > successful call, subsequent reads from this range will return > > > zeroes. > > > > > > It means we effectively have 2M filesystem block size. > > > > The question is still whether this will case problems for apps. > > > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs > > filesystem act like they have a 2M blocksize and others like they have > > 4k? Would that confuse apps? > > At risk of addressing the tip of an iceberg, before diving down to > scope out the rest of the iceberg... > > So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill: > I don't think it matters to anyone if it skips some zeroed small pages > within a hugepage. It may cause some artificial tests of holepunch and > SEEK_HOLE to fail, and it ought to be documented as a limitation from > choosing to enable THP (Kirill's way) on a filesystem, but I don't think > it's an ABI break to worry about: anyone who cares just shouldn't enable. > > (Though in the case of my huge tmpfs, it's the reverse: the small hole > punch splits the hugepage; but it's natural that Kirill's way would try > to hold on to its compound pages for longer than I do, and that's fine > so long as it's all consistent.) > > But I may disagree with "we effectively have 2M filesystem block size", > beyond the SEEK_HOLE case. If we're emulating hugetlbfs in tmpfs, sure, > we would have 2M filesystem block size. But if we're enabling THP > (emphasis on T for Transparent) in tmpfs (or another filesystem), then > when it matters it must act as if the block size is the 4k (or whatever) > it usually is. When it matters? Approaching memcg limit or ENOSPC > spring to mind. > > Ah, but suppose someone holepunches out most of each 2M page: they would > expect the memcg not to be charged for those holes (just as when they > munmap most of an anonymous THP) - that does suggest splitting is needed. Hmm.. As split_huge_pages() can fail, we wound need to propagate this error to userspace. This potentially triggers some other user-visible effect. EBUSY is not on list of fallocate(2) errror codes. I think we can invent a way to track if a THP has punch-holed subpages and prevent the compound page from being mapped as PMD or mapping these subpages. But I'm reluctant doing it upfront until real users emerge. I would propose to see what user demands will be. May be we overthink the situation. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html