On 03/16/2018 07:37 AM, Alex Elder wrote: > From: Ilya Dryomov <idryomov@xxxxxxxxx> > > bl, stripeno and objsetno should be u64 -- otherwise large enough files > get corrupted. How large depends on file layout: > > - 4M-objects layout (default): any file over 16P > - 64K-objects layout (smallest possible object size): any file over 512T > > Only CephFS is affected, rbd doesn't use ceph_calc_file_object_mapping() > yet. Fortunately, CephFS has a max_file_size configurable, the default > for which is way below both of the above numbers. > > Reimplement the logic from scratch with no layout validation -- it's > done on the MDS side. > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> OK I have some comments on this. What you've produced is much better than what was there before--you've really simplified it a lot. And it looks correct to me, so I'll mark it: Reviewed-by: Alex Elder <elder@xxxxxxxxxx> But please consider my comments and see if you can improve it a little further. I think this function is a great place to explain a lot of detail about striping. It ends up beiong very simple, which is excellent. > --- > net/ceph/osdmap.c | 88 ++++++++++++++++++------------------------------------- > 1 file changed, 28 insertions(+), 60 deletions(-) > > diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c > index 0da27c66349a..6c1cd4e2f7a7 100644 > --- a/net/ceph/osdmap.c > +++ b/net/ceph/osdmap.c > @@ -2141,72 +2141,40 @@ bool ceph_osds_changed(const struct ceph_osds *old_acting, > } > > /* > - * calculate file layout from given offset, length. > - * fill in correct oid, logical length, and object extent > - * offset, length. > + * Map a file extent to a stripe unit within an object. > + * Fill in objno, offset into object, and object extent length (i.e. the > + * number of bytes mapped, less than or equal to @l->stripe_unit). > * > - * for now, we write only a single su, until we can > - * pass a stride back to the caller. > + * Example for stripe_count = 3, stripes_per_object = 4: > + * > + * blockno | 0 3 6 9 | 1 4 7 10 | 2 5 8 11 | 12 15 18 21 | 13 16 19 The above represents the mapping of data in the original file across objects. The numbers are offsets of "stripe unit" size. They do not correlate to the "blockno" variable defined below. So the name here should probably change, and it may warrant a little more explanation. > + * stripeno | 0 1 2 3 | 0 1 2 3 | 0 1 2 3 | 4 5 6 7 | 4 5 6 > + * stripepos | 0 | 1 | 2 | 0 | 1 > + * objno | 0 | 1 | 2 | 3 | 4 > + * objsetno | 0 | 1 > */ > -int ceph_calc_file_object_mapping(struct ceph_file_layout *layout, > +int ceph_calc_file_object_mapping(struct ceph_file_layout *l, > u64 off, u64 len, > - u64 *ono, > - u64 *oxoff, u64 *oxlen) > + u64 *objno, u64 *objoff, u64 *xlen) > { > - u32 osize = layout->object_size; > - u32 su = layout->stripe_unit; > - u32 sc = layout->stripe_count; > - u32 bl, stripeno, stripepos, objsetno; > - u32 su_per_object; > - u64 t, su_offset; > + u32 stripes_per_object = l->object_size / l->stripe_unit; > + u64 blockno; /* which su */ Maybe /* which su within file */ (or maybe "file" won't be the proper general term any more?). Wouldn't "suno" (or just "su", or maybe "which_su" with no comment) be a better name? > + u32 blockoff; /* offset into su */ You could maybe add a "block_off" line in your helpful picture above. blockoff | 0 1 2 3 | 0 1 2 3 | 0 1 2 3 | 0 1 2 3 | 0 1 2 Somehow it would be nice to have consistency for these remainders. Like maybe "blockpos" rather than "blockoff" to match the rest. Or maybe the others should change, or maybe neither "off" nor "pos" is the right suffix. In any case, what you're doing has such nice uniformity below, e.g.: num = ldiv_rem(val, unit_size, &remainder); it would be nice if all the names followed consistent patterns. If I had a more specific suggestion I'd provide it; hopefully my meaning is clear, and you can consider it. > + u64 stripeno; /* which stripe */ > + u32 stripepos; /* which su in the stripe, > + which object in the object set */ So a "stripe unit" is a fixed number of "blocks". A "stripe" is composed of one "stripe unit" from each of "stripe_count" objects (and those objects together form an "object set"). An object holds (up to) "stripes_per_object" stripe units. That number corresponds also to the number of stripes that span an object set before moving on to the next object set for subsequent data. Is that all correct? I don't know where the terms are formally defined but this function would be as good a place as any in the code to define them. This stuff can be really confusing, so using well-defined terms is important. (I've seen different naming schemes for striping before, and it leads to much confusion.) > + u64 objsetno; /* which object set */ > + u32 objsetpos; /* which stripe in the object set */ > + > + blockno = div_u64_rem(off, l->stripe_unit, &blockoff);> + stripeno = div_u64_rem(blockno, l->stripe_count, &stripepos); > + objsetno = div_u64_rem(stripeno, stripes_per_object, &objsetpos); > + > + *objno = objsetno * l->stripe_count + stripepos; > + *objoff = objsetpos * l->stripe_unit + blockoff; A one-line comment explaining the truncation below wouldn't hurt. Perhaps just at top of the function. -Alex > + *xlen = min_t(u64, len, l->stripe_unit - blockoff); > > - dout("mapping %llu~%llu osize %u fl_su %u\n", off, len, > - osize, su); > - if (su == 0 || sc == 0) > - goto invalid; > - su_per_object = osize / su; > - if (su_per_object == 0) > - goto invalid; > - dout("osize %u / su %u = su_per_object %u\n", osize, su, > - su_per_object); > - > - if ((su & ~PAGE_MASK) != 0) > - goto invalid; > - > - /* bl = *off / su; */ > - t = off; > - do_div(t, su); > - bl = t; > - dout("off %llu / su %u = bl %u\n", off, su, bl); > - > - stripeno = bl / sc; > - stripepos = bl % sc; > - objsetno = stripeno / su_per_object; > - > - *ono = objsetno * sc + stripepos; > - dout("objset %u * sc %u = ono %u\n", objsetno, sc, (unsigned int)*ono); > - > - /* *oxoff = *off % layout->fl_stripe_unit; # offset in su */ > - t = off; > - su_offset = do_div(t, su); > - *oxoff = su_offset + (stripeno % su_per_object) * su; > - > - /* > - * Calculate the length of the extent being written to the selected > - * object. This is the minimum of the full length requested (len) or > - * the remainder of the current stripe being written to. > - */ > - *oxlen = min_t(u64, len, su - su_offset); > - > - dout(" obj extent %llu~%llu\n", *oxoff, *oxlen); > return 0; > - > -invalid: > - dout(" invalid layout\n"); > - *ono = 0; > - *oxoff = 0; > - *oxlen = 0; > - return -EINVAL; > } > EXPORT_SYMBOL(ceph_calc_file_object_mapping); > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html