Re: Readahead for compressed data

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Fri, 22 Oct 2021 18:40:02 +0800

On 2021/10/22 17:54, Gao Xiang wrote:
On Fri, Oct 22, 2021 at 05:39:29PM +0800, Gao Xiang wrote:
Hi Qu,

On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote:

On 2021/10/22 17:11, Gao Xiang wrote:
On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
On Thu 21-10-21 21:04:45, Phillip Susi wrote:

Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:

As far as I can tell, the following filesystems support compressed data:

bcachefs, btrfs, erofs, ntfs, squashfs, zisofs

I'd like to make it easier and more efficient for filesystems to
implement compressed data.  There are a lot of approaches in use today,
but none of them seem quite right to me.  I'm going to lay out a few
design considerations next and then propose a solution.  Feel free to
tell me I've got the constraints wrong, or suggest alternative solutions.

When we call ->readahead from the VFS, the VFS has decided which pages
are going to be the most useful to bring in, but it doesn't know how
pages are bundled together into blocks.  As I've learned from talking to
Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
something we can teach the VFS.

We (David) added readahead_expand() recently to let the filesystem
opportunistically add pages to the page cache "around" the area requested
by the VFS.  That reduces the number of times the filesystem has to
decompress the same block.  But it can fail (due to memory allocation
failures or pages already being present in the cache).  So filesystems
still have to implement some kind of fallback.

Wouldn't it be better to keep the *compressed* data in the cache and
decompress it multiple times if needed rather than decompress it once
and cache the decompressed data?  You would use more CPU time
decompressing multiple times, but be able to cache more data and avoid
more disk IO, which is generally far slower than the CPU can decompress
the data.

Well, one of the problems with keeping compressed data is that for mmap(2)
you have to have pages decompressed so that CPU can access them. So keeping
compressed data in the page cache would add a bunch of complexity. That
being said keeping compressed data cached somewhere else than in the page
cache may certainly me worth it and then just filling page cache on demand
from this data...

It can be cached with a special internal inode, so no need to take
care of the memory reclaim or migration by yourself.

There is another problem for btrfs (and maybe other fses).

Btrfs can refer to part of the uncompressed data extent.

Thus we could have the following cases:

	0	4K	8K	12K
	|	|	|	|
		    |	    \- 4K of an 128K compressed extent,
		    |		compressed size is 16K
		    \- 4K of an 64K compressed extent,
			compressed size is 8K

Thanks for this, but the diagram is broken on my side.
Could you help fix this?

Ok, I understand it. I think here is really a strategy problem
out of CoW, since only 2*4K is needed, you could
  1) cache the whole compressed extent and hope they can be accessed
     later, so no I/O later at all;
  2) don't cache such incomplete compressed extents;
  3) add some trace record and do some finer strategy.

Yeah, this should be determined by each fs, like whether they want to
cache compressed extent at all, and at which condition to cache
compressed extent.

(e.g. for btrfs, if we find the range we want is even smaller than the
compressed size, we can skip such cache)

Thus I don't think there would be a silver bullet for such case.

In above case, we really only need 8K for page cache, but if we're
caching the compressed extent, it will take extra 24K.

Apart from that, with my wild guess, could we cache whatever the
real I/O is rather than the whole compressed extent unconditionally?
If the whole compressed extent is needed then, we could check if
it's all available in cache, or read the rest instead?

Also, I think no need to cache uncompressed COW data...

Yeah, that's definitely the case, as page cache is already doing the
work for us.

Thanks,
Qu

Thanks,
Gao Xiang

It's definitely not really worthy for this particular corner case.

But it would be pretty common for btrfs, as CoW on compressed data can
easily lead to above cases.

Thanks,
Qu

Otherwise, these all need to be take care of. For fixed-sized input
compression, since they are reclaimed in page unit, so it won't be
quite friendly since such data is all coupling. But for fixed-sized
output compression, it's quite natural.

Thanks,
Gao Xiang

								Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR