Re: What are the I/O boundaries for read/write to a ceph object?

David Howells <dhowells@xxxxxxxxxx> · Fri, 14 Mar 2025 14:29:30 +0000

Hi Ilya,

Ilya Dryomov <idryomov@xxxxxxxxx> wrote:

> > Can you tell me what the I/O boundaries are for splitting up a read or a
> > write request into separate subrequests?
> >
> > Does each RPC call need to fit within the bounds of an object or does it
> > need to fit within the bounds of a stripe/block?
>
> Within the bounds of a RADOS object.

Okay, thanks.

> > Can a vectored read/write access multiple objects/blocks?
> 
> I'm not sure what "vectored" means in this context,

Where rather than issuing, say, a read data RPC with a single range to read, I
can give it a list of non-contiguous regions to read.  I might do this, for
example, if the VM issues a readahead request for a non-contiguous set of
folios that fill in the gaps around a folio already present in the pagecache.

> but a single read/write coming from the VFS may need to access multiple
> RADOS objects.  Assuming that the object size is 4M (default), the simplest
> example is a request for 8192 bytes at 4190208 offset in the file.

netfslib allows for a request to be split up into a number of subrequests,
where each subrequest can be of a different size and may access a different
server or fscache.  What I need to make the ->prepare_read() function do is,
for the specified starting point in the given file, return how many bytes we
can possibly read before we have to issue the next subrequest.

I currently have this (note this isn't what is in the patches I posted
yesterday):

	static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
	{
		struct netfs_io_request *rreq = subreq->rreq;
		struct ceph_inode_info *ci = ceph_inode(rreq->inode);
		struct ceph_fs_client *fsc =
			ceph_inode_to_fs_client(rreq->inode);
		const struct ceph_file_layout *layout = &ci->i_layout;

		size_t blocksize = layout->stripe_unit;
		size_t blockoff = subreq->start & (blocksize - 1);

		/* Truncate the extent at the end of the current block */
		rreq->io_streams[0].sreq_max_len =
			umin(blocksize - blockoff,
			     fsc->mount_options->rsize);

		return 0;
	}

where "rreq->io_streams[0].sreq_max_len" gets set to the maximum length we can
make the next subrequest.  I've made a number of assumptions here that I don't
know are valid:

 - The I/O block size is the stripe unit size.
 - Blocks are all the same size.
 - Blocks are a power-of-2 size.

> > What I'm trying to do is to avoid using ceph_calc_file_object_mapping() as
> > it does a bunch of 128-bit divisions for which I don't need the answers.
> > I only need xlen - and really, I just need the limits of the read or write
> > I can make.
> 
> I don't think ceph_calc_file_object_mapping() can be avoided in the
> general case.  With non-default ("fancy") striping, given for example
> stripe_unit=64K and stripe_count=5, a single 64K * 6 = 384K request at
> offset 0 in the file would need to access 5 RADOS objects, with the
> first object/RPC delivering 128K and the other four objects/RPCs 64K
> each.

ceph_calc_file_object_mapping() seems to assume that the stripe_unit size and
the object_size are fixed.  Is this something that might change?

Would you object to me putting an additional function in libceph next to that
one that just gets me that span of the block containing the specified file
position?

Thanks,
David