Re: [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf

David Howells <dhowells@xxxxxxxxxx> · Mon, 17 Mar 2025 11:27:37 +0000

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote:

> > +struct ceph_databuf {
> > +	struct bio_vec	*bvec;		/* List of pages */
> 
> So, maybe we need to think about folios now?

Yeah, I know...  but struct bio_vec has a page pointer and may point to
non-folio type pages.  This stuff is still undergoing evolution as Willy works
on reducing struct page.

What I'm pondering is changing struct folio_queue to take a list of { folio,
offset, len } rather than using a folio_batch with a simple list of folios.
It doesn't necessarily help with DIO, though, but there we're given an
iterator we're required to use.

One of the things I'd like to look at for ceph as well is using the page frag
allocator[*] to get small pieces of memory for stashing protocol data in
rather than allocating full-page buffers.

[*] Memory allocated from the page frag allocator can be used with
MSG_SPLICE_PAGES as its lifetime is controlled by the refcount.  Now, we could
probably have a page frag allocator that uses folios rather than non-folio
pages for network filesystem use.  That could be of use to afs and cifs also.

As I mentioned, in a previous reply, how to better integrate folioq/bvec is
hopefully up for discussion at LSF/MM next week.

> > +static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf,
> > +					    struct page *page,
> > +					    unsigned int offset,
> > +					    unsigned int len)
> > +{
> > +	BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec);
> > +	bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset);
> > +	dbuf->iter.count += len;
> > +	dbuf->iter.nr_segs++;
> 
> Why do we assign len to dbuf->iter.count but only increment
> dbuf->iter.nr_segs?

Um, because it doesn't?  It adds len to dbuf->iter.count.

> >  enum ceph_msg_data_type {
> >  	CEPH_MSG_DATA_NONE,	/* message contains no data payload */
> > +	CEPH_MSG_DATA_DATABUF,	/* data source/destination is a data buffer */
> >  	CEPH_MSG_DATA_PAGES,	/* data source/destination is a page array */
> >  	CEPH_MSG_DATA_PAGELIST,	/* data source/destination is a pagelist */
> 
> So, the final replacement on databuf will be in the future?

The result of each patch has to compile and work, right?  But yes, various of
the patches in this series reduce the use of those other data types.  I have
patches in progress to finally remove PAGES and PAGELIST, but they're not
quite compiling yet.

> > +	dbuf = kzalloc(sizeof(*dbuf), gfp);
> > +	if (!dbuf)
> > +		return NULL;
> 
> I am guessing... Should we return error code here?

The only error this function can return is ENOMEM, so it just returns NULL
like many other alloc functions.

> > +	} else if (min_bvec) {
> > +		min_bvec = umax(min_bvec, 16);
> 
> Why 16 here? Maybe, do we need to introduce some well explained constant?

Fair point.

> > +		dbuf->max_bvec = min_bvec;
> 
> Why do we assign min_bvec to max_bvec? I am simply slightly confused why
> argument of function is named as min_bvec, but finally we are saving min_bvec
> value into max_bvec.

The 'min_bvec' argument is the minimum number of bvecs that the caller needs
to be allocated.  This may get rounded up to include all of the piece of
memory we're going to be given by the slab.

'dbuf->max_bvec' is the maximum number of entries that can be used in
dbuf->bvec[] and is a property of the databuf object.

> > +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf)
> 
> I see the point here. But do we really need to return pointer? Why not simply:
> 
> void ceph_databuf_get(struct ceph_databuf *dbuf)

It means I can do:

	foo->databuf = ceph_databuf_get(dbuf);

rather than:

	ceph_databuf_get(dbuf);
	foo->databuf = dbuf;

> > +static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec,
> > +			       gfp_t gfp)
> > +{
> > +	struct bio_vec *bvec = dbuf->bvec, *old = bvec;
> 
> I think that assigning (*old = bvec) looks confusing if we keep it on the same
> line as bvec declaration and initialization. Why do not declare and not
> initialize it on the next line?
> 
> > +	size_t size, max_bvec, off = dbuf->iter.bvec - old;
> 
> I think it's too much declarations on the same line. Why not:
> 
> size_t size, max_bvec;
> size_t off = dbuf->iter.bvec - old;

A matter of personal preference, I guess.

> > +	bvec = dbuf->bvec;
> > +	while (dbuf->nr_bvec < req_bvec) {
> > +		struct page *pages[16];
> 
> Why do we hardcoded 16 here but using some well defined constant?

Because this is only about stack usage.  alloc_pages_bulk() gets an straight
array of page*; we have a bvec[], so we need an intermediate.  Now, I could
actually just overlay the array over the tail of the bvec[] and do a single
bulk allocation since sizeof(struct page*) > sizeof(struct bio_vec).

> And, again, why not folio?

I don't think there's a bulk folio allocator.  Quite possibly there *should*
be so that readahead can use it - one that allocates different sizes of folios
to fill the space required.

> > +		size_t want = min(req_bvec, ARRAY_SIZE(pages)), got;
> > +
> > +		memset(pages, 0, sizeof(pages));
> > +		got = alloc_pages_bulk(gfp, want, pages);
> > +		if (!got)
> > +			return -ENOMEM;
> > +		for (i = 0; i < got; i++)
> 
> Why do we use size_t for i and got? Why not int, for example?

alloc_pages_bulk() doesn't return an int.  Now, one could legitimately argue
that I should use "unsigned long" rather than "size_t", but I wouldn't use int
here.  int is smaller and signed.  Granted, it's unlikely we'll be asked >2G
pages, but if we're going to assign it down to an int, it probably needs to be
checked first.

> > +			bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i],
> > +				      PAGE_SIZE, 0);
> > +		dbuf->iter.nr_segs += got;
> > +		dbuf->nr_bvec += got;
> 
> If I understood correctly, the ceph_databuf_append_page() uses slightly
> different logic.

Can you elaborate?

> +	dbuf->iter.count += len;
> +	dbuf->iter.nr_segs++;
> 
> But here we assign number of allocated pages to nr_segs. It is slightly
> confusing. I think I am missing something here.

Um - it's an incremement?

I think part of the problem might be that we're mixing two things within the
same container: Partial pages that get kmapped and accessed directly
(e.g. protocol bits) and pages that get accessed indirectly (e.g. data
buffers).  Maybe this needs to be made more explicit in the API.

David