Re: What are the I/O boundaries for read/write to a ceph object?

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 17 Mar 2025 21:31:32 +0100

On Fri, Mar 14, 2025 at 3:29 PM David Howells <dhowells@xxxxxxxxxx> wrote:
>
> Hi Ilya,
>
> Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> > > Can you tell me what the I/O boundaries are for splitting up a read or a
> > > write request into separate subrequests?
> > >
> > > Does each RPC call need to fit within the bounds of an object or does it
> > > need to fit within the bounds of a stripe/block?
> >
> > Within the bounds of a RADOS object.
>
> Okay, thanks.
>
> > > Can a vectored read/write access multiple objects/blocks?
> >
> > I'm not sure what "vectored" means in this context,
>
> Where rather than issuing, say, a read data RPC with a single range to read, I
> can give it a list of non-contiguous regions to read.  I might do this, for
> example, if the VM issues a readahead request for a non-contiguous set of
> folios that fill in the gaps around a folio already present in the pagecache.
>
> > but a single read/write coming from the VFS may need to access multiple
> > RADOS objects.  Assuming that the object size is 4M (default), the simplest
> > example is a request for 8192 bytes at 4190208 offset in the file.
>
> netfslib allows for a request to be split up into a number of subrequests,
> where each subrequest can be of a different size and may access a different
> server or fscache.  What I need to make the ->prepare_read() function do is,
> for the specified starting point in the given file, return how many bytes we
> can possibly read before we have to issue the next subrequest.
>
> I currently have this (note this isn't what is in the patches I posted
> yesterday):
>
>         static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
>         {
>                 struct netfs_io_request *rreq = subreq->rreq;
>                 struct ceph_inode_info *ci = ceph_inode(rreq->inode);
>                 struct ceph_fs_client *fsc =
>                         ceph_inode_to_fs_client(rreq->inode);
>                 const struct ceph_file_layout *layout = &ci->i_layout;
>
>                 size_t blocksize = layout->stripe_unit;
>                 size_t blockoff = subreq->start & (blocksize - 1);
>
>                 /* Truncate the extent at the end of the current block */
>                 rreq->io_streams[0].sreq_max_len =
>                         umin(blocksize - blockoff,
>                              fsc->mount_options->rsize);
>
>                 return 0;
>         }
>
> where "rreq->io_streams[0].sreq_max_len" gets set to the maximum length we can
> make the next subrequest.  I've made a number of assumptions here that I don't
> know are valid:
>
>  - The I/O block size is the stripe unit size.

Hi David,

This is valid, but operating purely in terms of stripe units won't be
optimal in the general case.  In the example that I gave in the previous
message, you would end up issuing 6 RPCs instead of 5, not recognizing
that the first and last logical blocks of the original 384K request are
contiguous within the first RADOS object and could be done in one go.

"Fancy" striping isn't widely used though, so if implementing it
optimally complicates things too much, I wouldn't sweat it.

>  - Blocks are all the same size.

This is valid (except for the EOF block).

>  - Blocks are a power-of-2 size.

This is NOT valid.  IIRC the constraints are that the stripe unit is
a multiple of 64K and that the object size is a multiple of the stripe
unit.  Technically there is nothing stopping the stripe unit ("block")
from being set to 192K, for example.

>
> > > What I'm trying to do is to avoid using ceph_calc_file_object_mapping() as
> > > it does a bunch of 128-bit divisions for which I don't need the answers.
> > > I only need xlen - and really, I just need the limits of the read or write
> > > I can make.
> >
> > I don't think ceph_calc_file_object_mapping() can be avoided in the
> > general case.  With non-default ("fancy") striping, given for example
> > stripe_unit=64K and stripe_count=5, a single 64K * 6 = 384K request at
> > offset 0 in the file would need to access 5 RADOS objects, with the
> > first object/RPC delivering 128K and the other four objects/RPCs 64K
> > each.
>
> ceph_calc_file_object_mapping() seems to assume that the stripe_unit size and
> the object_size are fixed.  Is this something that might change?

Not after the file is created.  Think of these as immutable file
attributes that affect the data placement.

>
> Would you object to me putting an additional function in libceph next to that
> one that just gets me that span of the block containing the specified file
> position?

Fine with me.

Thanks,

                Ilya