Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

Daniel Gomez <da.gomez@xxxxxxxxxxx> · Tue, 27 Feb 2024 11:42:01 +0000

On Tue, Feb 20, 2024 at 01:39:05PM +0100, Jan Kara wrote:
> On Tue 20-02-24 10:26:48, Daniel Gomez wrote:
> > On Mon, Feb 19, 2024 at 02:15:47AM -0800, Hugh Dickins wrote:
> > I'm uncertain when we may want to be more elastic. In the case of XFS with iomap
> > and support for large folios, for instance, we are 'less' elastic than here. So,
> > what exactly is the rationale behind wanting shmem to be 'more elastic'?
> 
> Well, but if you allocated space in larger chunks - as is the case with
> ext4 and bigalloc feature, you will be similarly 'elastic' as tmpfs with
> large folio support... So simply the granularity of allocation of
> underlying space is what matters here. And for tmpfs the underlying space
> happens to be the page cache.

But it seems like the underlying space 'behaves' differently when we talk about
large folios and huge pages. Is that correct? And this is reflected in the fstat
st_blksize. The first one is always based on the host base page size, regardless
of the order we get. The second one is always based on the host huge page size
configured (at the moment I've tested 2MiB, and 1GiB for x86-64 and 2MiB, 512
MiB and 16GiB for ARM64).

If that is the case, I'd agree this is not needed for huge pages but only when
we adopt large folios. Otherwise, we won't have a way to determine the step/
granularity for seeking data/holes as it could be anything from order-0 to
order-9. Note: order-1 support currently in LBS v1 thread here [1].

Regarding large folios adoption, we have the following implementations [2] being
sent to the mailing list. Would it make sense then, to have this block tracking
for the large folios case? Notice that my last attempt includes a partial
implementation of block tracking discussed here.

[1] https://lore.kernel.org/all/20240226094936.2677493-2-kernel@xxxxxxxxxxxxxxxx/

[2] shmem: high order folios support in write path
v1: https://lore.kernel.org/all/20230915095042.1320180-1-da.gomez@xxxxxxxxxxx/
v2: https://lore.kernel.org/all/20230919135536.2165715-1-da.gomez@xxxxxxxxxxx/
v3 (RFC): https://lore.kernel.org/all/20231028211518.3424020-1-da.gomez@xxxxxxxxxxx/

> 
> > If we ever move shmem to large folios [1], and we use them in an oportunistic way,
> > then we are going to be more elastic in the default path.
> > 
> > [1] https://lore.kernel.org/all/20230919135536.2165715-1-da.gomez@xxxxxxxxxxx
> > 
> > In addition, I think that having this block granularity can benefit quota
> > support and the reclaim path. For example, in the generic/100 fstest, around
> > ~26M of data are reported as 1G of used disk when using tmpfs with huge pages.
> 
> And I'd argue this is a desirable thing. If 1G worth of pages is attached
> to the inode, then quota should be accounting 1G usage even though you've
> written just 26MB of data to the file. Quota is about constraining used
> resources, not about "how much did I write to the file".

But these are two separate values. I get that the system wants to track how many
pages are attached to the inode, so is there a way to report (in addition) the
actual use of these pages being consumed?

> 
> 								Honza
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR