On Tue, Dec 13, 2022 at 06:29:33PM +0100, Andrey Albershteyn wrote: > Add fs-verity page verification in read IO path. The verification > itself is offloaded into workqueue (provided by fs-verity). > > The work_struct items are allocated from bioset side by side with > bio being processed. > > As inodes with fs-verity doesn't use large folios we check only > first page of the folio for errors (set by fs-verity if verification > failed). > > Signed-off-by: Andrey Albershteyn <aalbersh@xxxxxxxxxx> > --- > fs/iomap/buffered-io.c | 80 +++++++++++++++++++++++++++++++++++++++--- > include/linux/iomap.h | 5 +++ > 2 files changed, 81 insertions(+), 4 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 91ee0b308e13d..b7abc2f806cfc 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -17,6 +17,7 @@ > #include <linux/bio.h> > #include <linux/sched/signal.h> > #include <linux/migrate.h> > +#include <linux/fsverity.h> > #include "trace.h" > > #include "../internal.h" > @@ -42,6 +43,7 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio) > } > > static struct bio_set iomap_ioend_bioset; > +static struct bio_set iomap_readend_bioset; > > static struct iomap_page * > iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags) > @@ -189,9 +191,39 @@ static void iomap_read_end_io(struct bio *bio) > int error = blk_status_to_errno(bio->bi_status); > struct folio_iter fi; > > - bio_for_each_folio_all(fi, bio) > + bio_for_each_folio_all(fi, bio) { > + /* > + * As fs-verity doesn't work with multi-page folios, verity > + * inodes have large folios disabled (only single page folios > + * are used) > + */ > + if (!error) > + error = PageError(folio_page(fi.folio, 0)); > + > iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error); > + } > + > bio_put(bio); > + /* The iomap_readend has been freed by bio_put() */ > +} > + > +static void iomap_read_work_end_io( > + struct work_struct *work) > +{ > + struct iomap_readend *ctx = > + container_of(work, struct iomap_readend, read_work); > + struct bio *bio = &ctx->read_inline_bio; > + > + fsverity_verify_bio(bio); > + iomap_read_end_io(bio); > +} > + > +static void iomap_read_work_io(struct bio *bio) > +{ > + struct iomap_readend *ctx = > + container_of(bio, struct iomap_readend, read_inline_bio); > + > + fsverity_enqueue_verify_work(&ctx->read_work); > } Ok, so fsverity simple queues this to a global workqueue it has set up as WQ_UNBOUND | WQ_HIGHPRI with, effectively, one worker thread per CPU. This will work for simple cases and to get the basic infrastructure in place, but there are problems here that we will need to address. 1. High priority workqueues are used within XFS to ensure that data IO completion cannot stall processing of journal IO completions. We use them to prevent IO priority inversion deadlocks. That is, journal IO completions use a WQ_HIGHPRI workqueue to ensure that they are scheduled ahead of data IO completions as data IO completions may require journal IO to complete before they can make progress (e.g. to ensure transaction reservations in IO completion can make progress). Hence using a WQ_HIGHPRI workqueue directly in the user data IO path is a potential filesystem livelock/deadlock vector. At best, it's going to impact on the latency of journal commits and hence anything that is running data integrity operations whilst fsverity read operations are in progress. The second thing is that the fsverity workqueue is global - it creates a cross-filesystem contention point. That means a single busy fsverity filesystem will create unpredictable IO latency for filesystems that only use fsverity sparingly. i.e. heavy amounts of fsverity work on one filesystem will impact the performance of journal and other IO operations on all other active filesystems due to fsverity using WQ_HIGHPRI. This sort of workqueue setup is typically fine for a single user device that has lots of otherwise idle CPU that can be co-opted for create concurrency in IO completion work where there would otherwise been none (e.g. an Android phone). However, it is less than ideal for a high concurrent application using AIO or io_uring to push a couple of million IOPS through the filesystem. In these sitations, we don't want IO completion work to be spread outi because there is no idle CPU for them to be run on. Instead, locality is king - we want them run on the same CPU that already has the inode, bio, and other objects hot in the cache. So, really, for XFS we want per-filesystem, locality preserving per-cpu workqueues for fsverity work as we get IO concurrency (and therefore fsverity work concurrency) from the application IO patterns. And it avoids all potential issues with using WQ_HIGHPRI for journal IO to avoid data IO completion deadlocks. > struct iomap_readpage_ctx { > @@ -264,6 +296,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, > loff_t orig_pos = pos; > size_t poff, plen; > sector_t sector; > + struct iomap_readend *readend; > > if (iomap->type == IOMAP_INLINE) > return iomap_read_inline_data(iter, folio); > @@ -276,7 +309,21 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, > > if (iomap_block_needs_zeroing(iter, pos)) { > folio_zero_range(folio, poff, plen); > - iomap_set_range_uptodate(folio, iop, poff, plen); > + if (!fsverity_active(iter->inode)) { > + iomap_set_range_uptodate(folio, iop, poff, plen); > + goto done; > + } > + > + /* > + * As fs-verity doesn't work with folios sealed inodes have > + * multi-page folios disabled and we can check on first and only > + * page > + */ > + if (fsverity_verify_page(folio_page(folio, 0))) > + iomap_set_range_uptodate(folio, iop, poff, plen); > + else > + folio_set_error(folio); This makes me wonder if this should be a iomap->page_op method rather than calling fsverity code directly. i.e. if the filesystem knows that it fsverity is enabled on the inode during it's iomap_begin() callout, and that state *must not change* until the IO is complete. Hence it can set a iomap flag saying IOMAP_F_READ_VERIFY and add a page_ops vector with a "verify_folio" callout. Then this code can do: if (iomap_block_needs_zeroing(iter, pos)) { folio_zero_range(folio, poff, plen); if (iomap->flags & IOMAP_F_READ_VERIFY) { if (!iomap->page_ops->verify_folio(folio, poff, plen)) { folio_set_error(folio); goto done; } } iomap_set_range_uptodate(folio, iop, poff, plen); got done; } > @@ -297,8 +344,18 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, > > if (ctx->rac) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > - ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), > + if (fsverity_active(iter->inode)) { > + ctx->bio = bio_alloc_bioset(iomap->bdev, > + bio_max_segs(nr_vecs), REQ_OP_READ, > + GFP_NOFS, &iomap_readend_bioset); > + readend = container_of(ctx->bio, > + struct iomap_readend, > + read_inline_bio); > + INIT_WORK(&readend->read_work, iomap_read_work_end_io); > + } else { > + ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs), > REQ_OP_READ, gfp); > + } > /* > * If the bio_alloc fails, try it again for a single page to > * avoid having to deal with partial page reads. This emulates > @@ -311,7 +368,11 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, > if (ctx->rac) > ctx->bio->bi_opf |= REQ_RAHEAD; > ctx->bio->bi_iter.bi_sector = sector; > - ctx->bio->bi_end_io = iomap_read_end_io; > + if (fsverity_active(iter->inode)) > + ctx->bio->bi_end_io = iomap_read_work_io; > + else > + ctx->bio->bi_end_io = iomap_read_end_io; Hmmm. OK, why not just always use the iomap_readend_bioset, put a flags field in it and then do this check in iomap_read_end_io()? i.e. struct iomap_read_ioend { bool queue_work; struct work_struct work; /* post read work (fs-verity) */ struct bio inline_bio;/* MUST BE LAST! */ }; Then always allocate from the iomap_read_ioend_bioset, and do: ctx->bio->bi_end_io = iomap_read_end_io; if (iomap->flags & IOMAP_F_VERITY) { read_ioend->queue_work = true; INIT_WORK(&read_ioend->work, iomap_read_ioend_work); } The in iomap_read_end_io(): if (ioend->queue_work) { fsverity_enqueue_verify_work(&ctx->work); return; } iomap_read_end_io(bio); And if we put a fs supplied workqueue into the iomap as well, then we have a mechanism for the filesystem to supply it's own workqueue for fsverity, fscrypt, or any other read-side post-IO data processing functions filesystems care to add. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx