Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote: > > +struct ceph_databuf { > > + struct bio_vec *bvec; /* List of pages */ > > So, maybe we need to think about folios now? Yeah, I know... but struct bio_vec has a page pointer and may point to non-folio type pages. This stuff is still undergoing evolution as Willy works on reducing struct page. What I'm pondering is changing struct folio_queue to take a list of { folio, offset, len } rather than using a folio_batch with a simple list of folios. It doesn't necessarily help with DIO, though, but there we're given an iterator we're required to use. One of the things I'd like to look at for ceph as well is using the page frag allocator[*] to get small pieces of memory for stashing protocol data in rather than allocating full-page buffers. [*] Memory allocated from the page frag allocator can be used with MSG_SPLICE_PAGES as its lifetime is controlled by the refcount. Now, we could probably have a page frag allocator that uses folios rather than non-folio pages for network filesystem use. That could be of use to afs and cifs also. As I mentioned, in a previous reply, how to better integrate folioq/bvec is hopefully up for discussion at LSF/MM next week. > > +static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf, > > + struct page *page, > > + unsigned int offset, > > + unsigned int len) > > +{ > > + BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec); > > + bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset); > > + dbuf->iter.count += len; > > + dbuf->iter.nr_segs++; > > Why do we assign len to dbuf->iter.count but only increment > dbuf->iter.nr_segs? Um, because it doesn't? It adds len to dbuf->iter.count. > > enum ceph_msg_data_type { > > CEPH_MSG_DATA_NONE, /* message contains no data payload */ > > + CEPH_MSG_DATA_DATABUF, /* data source/destination is a data buffer */ > > CEPH_MSG_DATA_PAGES, /* data source/destination is a page array */ > > CEPH_MSG_DATA_PAGELIST, /* data source/destination is a pagelist */ > > So, the final replacement on databuf will be in the future? The result of each patch has to compile and work, right? But yes, various of the patches in this series reduce the use of those other data types. I have patches in progress to finally remove PAGES and PAGELIST, but they're not quite compiling yet. > > + dbuf = kzalloc(sizeof(*dbuf), gfp); > > + if (!dbuf) > > + return NULL; > > I am guessing... Should we return error code here? The only error this function can return is ENOMEM, so it just returns NULL like many other alloc functions. > > + } else if (min_bvec) { > > + min_bvec = umax(min_bvec, 16); > > Why 16 here? Maybe, do we need to introduce some well explained constant? Fair point. > > + dbuf->max_bvec = min_bvec; > > Why do we assign min_bvec to max_bvec? I am simply slightly confused why > argument of function is named as min_bvec, but finally we are saving min_bvec > value into max_bvec. The 'min_bvec' argument is the minimum number of bvecs that the caller needs to be allocated. This may get rounded up to include all of the piece of memory we're going to be given by the slab. 'dbuf->max_bvec' is the maximum number of entries that can be used in dbuf->bvec[] and is a property of the databuf object. > > +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf) > > I see the point here. But do we really need to return pointer? Why not simply: > > void ceph_databuf_get(struct ceph_databuf *dbuf) It means I can do: foo->databuf = ceph_databuf_get(dbuf); rather than: ceph_databuf_get(dbuf); foo->databuf = dbuf; > > +static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec, > > + gfp_t gfp) > > +{ > > + struct bio_vec *bvec = dbuf->bvec, *old = bvec; > > I think that assigning (*old = bvec) looks confusing if we keep it on the same > line as bvec declaration and initialization. Why do not declare and not > initialize it on the next line? > > > + size_t size, max_bvec, off = dbuf->iter.bvec - old; > > I think it's too much declarations on the same line. Why not: > > size_t size, max_bvec; > size_t off = dbuf->iter.bvec - old; A matter of personal preference, I guess. > > + bvec = dbuf->bvec; > > + while (dbuf->nr_bvec < req_bvec) { > > + struct page *pages[16]; > > Why do we hardcoded 16 here but using some well defined constant? Because this is only about stack usage. alloc_pages_bulk() gets an straight array of page*; we have a bvec[], so we need an intermediate. Now, I could actually just overlay the array over the tail of the bvec[] and do a single bulk allocation since sizeof(struct page*) > sizeof(struct bio_vec). > And, again, why not folio? I don't think there's a bulk folio allocator. Quite possibly there *should* be so that readahead can use it - one that allocates different sizes of folios to fill the space required. > > + size_t want = min(req_bvec, ARRAY_SIZE(pages)), got; > > + > > + memset(pages, 0, sizeof(pages)); > > + got = alloc_pages_bulk(gfp, want, pages); > > + if (!got) > > + return -ENOMEM; > > + for (i = 0; i < got; i++) > > Why do we use size_t for i and got? Why not int, for example? alloc_pages_bulk() doesn't return an int. Now, one could legitimately argue that I should use "unsigned long" rather than "size_t", but I wouldn't use int here. int is smaller and signed. Granted, it's unlikely we'll be asked >2G pages, but if we're going to assign it down to an int, it probably needs to be checked first. > > + bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i], > > + PAGE_SIZE, 0); > > + dbuf->iter.nr_segs += got; > > + dbuf->nr_bvec += got; > > If I understood correctly, the ceph_databuf_append_page() uses slightly > different logic. Can you elaborate? > + dbuf->iter.count += len; > + dbuf->iter.nr_segs++; > > But here we assign number of allocated pages to nr_segs. It is slightly > confusing. I think I am missing something here. Um - it's an incremement? I think part of the problem might be that we're mixing two things within the same container: Partial pages that get kmapped and accessed directly (e.g. protocol bits) and pages that get accessed indirectly (e.g. data buffers). Maybe this needs to be made more explicit in the API. David