On Thu, May 25, 2023 at 09:34:29PM -0400, Kent Overstreet wrote: > On Thu, May 25, 2023 at 05:47:08PM -0700, Darrick J. Wong wrote: > > +struct xfarray { > > + /* Underlying file that backs the array. */ > > + struct xfile *xfile; > > + > > + /* Number of array elements. */ > > + xfarray_idx_t nr; > > + > > + /* Maximum possible array size. */ > > + xfarray_idx_t max_nr; > > + > > + /* Number of unset slots in the array below @nr. */ > > + uint64_t unset_slots; > > + > > + /* Size of an array element. */ > > + size_t obj_size; > > + > > + /* log2 of array element size, if possible. */ > > + int obj_size_log; > > +}; > > + > > +int xfarray_create(struct xfs_mount *mp, const char *descr, > > + unsigned long long required_capacity, size_t obj_size, > > + struct xfarray **arrayp); > > +void xfarray_destroy(struct xfarray *array); > > +int xfarray_load(struct xfarray *array, xfarray_idx_t idx, void *ptr); > > +int xfarray_unset(struct xfarray *array, xfarray_idx_t idx); > > +int xfarray_store(struct xfarray *array, xfarray_idx_t idx, const void *ptr); > > +int xfarray_store_anywhere(struct xfarray *array, const void *ptr); > > +bool xfarray_element_is_null(struct xfarray *array, const void *ptr); > > Nice simple external interface... +1 > > Since you're storing fixed size elements, if you wanted to make it > slicker you could steal the generic-radix tree approach of using a > wrapper type to make the object size known at compile time, which lets > you constant propagate through the index -> offset calculations. > > But not worth it from a performance POV with the current implementation, > because... > > > +/* > > + * Read a memory object directly from the xfile's page cache. Unlike regular > > + * pread, we return -E2BIG and -EFBIG for reads that are too large or at too > > + * high an offset, instead of truncating the read. Otherwise, we return > > + * bytes read or an error code, like regular pread. > > + */ > > +ssize_t > > +xfile_pread( > > + struct xfile *xf, > > + void *buf, > > + size_t count, > > + loff_t pos) > > +{ > > + struct inode *inode = file_inode(xf->file); > > + struct address_space *mapping = inode->i_mapping; > > + struct page *page = NULL; > > + ssize_t read = 0; > > + unsigned int pflags; > > + int error = 0; > > + > > + if (count > MAX_RW_COUNT) > > + return -E2BIG; > > + if (inode->i_sb->s_maxbytes - pos < count) > > + return -EFBIG; > > + > > + trace_xfile_pread(xf, pos, count); > > + > > + pflags = memalloc_nofs_save(); > > + while (count > 0) { > > + void *p, *kaddr; > > + unsigned int len; > > + > > + len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos)); > > + > > + /* > > + * In-kernel reads of a shmem file cause it to allocate a page > > + * if the mapping shows a hole. Therefore, if we hit ENOMEM > > + * we can continue by zeroing the caller's buffer. > > + */ > > + page = shmem_read_mapping_page_gfp(mapping, pos >> PAGE_SHIFT, > > + __GFP_NOWARN); > > + if (IS_ERR(page)) { > > + error = PTR_ERR(page); > > + if (error != -ENOMEM) > > + break; > > + > > + memset(buf, 0, len); > > + goto advance; > > + } > > + > > + if (PageUptodate(page)) { > > + /* > > + * xfile pages must never be mapped into userspace, so > > + * we skip the dcache flush. > > + */ > > + kaddr = kmap_local_page(page); > > + p = kaddr + offset_in_page(pos); > > + memcpy(buf, p, len); > > + kunmap_local(kaddr); > > + } else { > > + memset(buf, 0, len); > > + } > > + put_page(page); > > + > > +advance: > > + count -= len; > > + pos += len; > > + buf += len; > > + read += len; > > + } > > + memalloc_nofs_restore(pflags); > > + > > + if (read > 0) > > + return read; > > + return error; > > +} > > this all, and the write path, looks a bit heavy - you're calling through > shmem_read_mapping_page_gfp() on every lookup. Does it matter? Longer term I'd like to work with willy on an in-kernel mmap and/or using large folios with the tmpfs file, but for now I only care that it works correctly and gets merged. :) > If we care about performance, we want to get it as much as possible down > to just the page cache radix tree lookup - and possibly cache the last > page returned if we care about sequential performance. (That comes later in this megapatchset.) > OTOH, maybe shmem_get_folio_gfp() and __filemap_get_folio() could > benefit from some early returns - > if (likely(got_the_thing_we_want)) return folio; > > Another thought... if obj_size <= PAGE_SIZE, maybe you could do what > genradix does and not have objects span pages? That would let you get > rid of the loop in read/write - but then you'd want to be doing an > interface that works in terms of pages/folios, which wouldn't be as > clean as what you've got. Yeah... the xfs dquot files store 136-byte dquot records which don't cross fsblock boundaries. There's a lot of math involved there, which at least there's an incore dquot object so we're mostly not pounding on the dquot file itself. --D > Just spitballing random ideas, looks good :)