On Aug 12, 2016, at 12:37 PM, Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> wrote: > > Here's stabilized version of my patchset which intended to bring huge pages > to ext4. > > The basics are the same as with tmpfs[1] which is in Linus' tree now and > ext4 built on top of it. The main difference is that we need to handle > read out from and write-back to backing storage. > > Head page links buffers for whole huge page. Dirty/writeback tracking > happens on per-hugepage level. > > We read out whole huge page at once. It required bumping BIO_MAX_PAGES to > not less than HPAGE_PMD_NR. I defined BIO_MAX_PAGES to HPAGE_PMD_NR if > huge pagecache enabled. > > On split_huge_page() we need to free buffers before splitting the page. > Page buffers takes additional pin on the page and can be a vector to mess > with the page during split. We want to avoid this. > If try_to_free_buffers() fails, split_huge_page() would return -EBUSY. > > Readahead doesn't play with huge pages well: 128k max readahead window, > assumption on page size, PageReadahead() to track hit/miss. I've got it > to allocate huge pages, but it doesn't provide any readahead as such. > I don't know how to do this right. It's not clear at this point if we > really need readahead with huge pages. I guess it's good enough for now. Typically read-ahead is a loss if you are able to get large allocations on disk, since you can get at least seek_rate * chunk_size throughput from the disks even with random IO at that size. With 1MB allocations and 7200 RPM drives this works out to be about 150MB/s, which is close to the throughput of these drive already. Cheers, Andreas > Shadow entries ignored on allocation -- recently evicted page is not > promoted to active list. Not sure if current workingset logic is adequate > for huge pages. On eviction, we split the huge page and setup 4k shadow > entries as usual. > > Unlike tmpfs, ext4 makes use of tags in radix-tree. The approach I used > for tmpfs -- 512 entries in radix-tree per-hugepages -- doesn't work well > if we want to have coherent view on tags. So the first 8 patches of the > patchset converts tmpfs to use multi-order entries in radix-tree. > The same infrastructure used for ext4. > > Encryption doesn't handle huge pages yet. To avoid regressions we just > disable huge pages for the inode if it has EXT4_INODE_ENCRYPT. > > With this version I don't see any xfstests regressions with huge pages enabled. > Patch with new configurations for xfstests-bld is below. > > Tested with 4k, 1k, encryption and bigalloc. All with and without > huge=always. I think it's reasonable coverage. > > The patchset is also in git: > > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugeext4/v2 > > Please review and consider applying. > > [1] http://lkml.kernel.org/r/1465222029-45942-1-git-send-email-kirill.shutemov@xxxxxxxxxxxxxxx
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail