Re: [PATCH 02/32] libceph: eliminate overflows in ceph_calc_file_object_mapping()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 16, 2018 at 8:06 PM, Alex Elder <elder@xxxxxxxx> wrote:
> On 03/16/2018 07:37 AM, Alex Elder wrote:
>> From: Ilya Dryomov <idryomov@xxxxxxxxx>
>>
>> bl, stripeno and objsetno should be u64 -- otherwise large enough files
>> get corrupted.  How large depends on file layout:
>>
>> - 4M-objects layout (default): any file over 16P
>> - 64K-objects layout (smallest possible object size): any file over 512T
>>
>> Only CephFS is affected, rbd doesn't use ceph_calc_file_object_mapping()
>> yet.  Fortunately, CephFS has a max_file_size configurable, the default
>> for which is way below both of the above numbers.
>>
>> Reimplement the logic from scratch with no layout validation -- it's
>> done on the MDS side.
>>
>> Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx>
>
> OK I have some comments on this.  What you've produced is much better than
> what was there before--you've really simplified it a lot.  And it looks
> correct to me, so I'll mark it:
>
> Reviewed-by: Alex Elder <elder@xxxxxxxxxx>
>
> But please consider my comments and see if you can improve it a little
> further.  I think this function is a great place to explain a lot of
> detail about striping.  It ends up beiong very simple, which is excellent.
>
>> ---
>>  net/ceph/osdmap.c | 88 ++++++++++++++++++-------------------------------------
>>  1 file changed, 28 insertions(+), 60 deletions(-)
>>
>> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
>> index 0da27c66349a..6c1cd4e2f7a7 100644
>> --- a/net/ceph/osdmap.c
>> +++ b/net/ceph/osdmap.c
>> @@ -2141,72 +2141,40 @@ bool ceph_osds_changed(const struct ceph_osds *old_acting,
>>  }
>>
>>  /*
>> - * calculate file layout from given offset, length.
>> - * fill in correct oid, logical length, and object extent
>> - * offset, length.
>> + * Map a file extent to a stripe unit within an object.
>> + * Fill in objno, offset into object, and object extent length (i.e. the
>> + * number of bytes mapped, less than or equal to @l->stripe_unit).
>>   *
>> - * for now, we write only a single su, until we can
>> - * pass a stride back to the caller.
>> + * Example for stripe_count = 3, stripes_per_object = 4:
>> + *
>> + * blockno   |  0  3  6  9 |  1  4  7 10 |  2  5  8 11 | 12 15 18 21 | 13 16 19
>
> The above represents the mapping of data in the original file across objects.
> The numbers are offsets of "stripe unit" size.  They do not correlate to the
> "blockno" variable defined below.  So the name here should probably change,
> and it may warrant a little more explanation.
>
>> + * stripeno  |  0  1  2  3 |  0  1  2  3 |  0  1  2  3 |  4  5  6  7 |  4  5  6
>> + * stripepos |      0      |      1      |      2      |      0      |      1
>> + * objno     |      0      |      1      |      2      |      3      |      4
>> + * objsetno  |                    0                    |                    1
>>   */
>> -int ceph_calc_file_object_mapping(struct ceph_file_layout *layout,
>> +int ceph_calc_file_object_mapping(struct ceph_file_layout *l,
>>                                  u64 off, u64 len,
>> -                                u64 *ono,
>> -                                u64 *oxoff, u64 *oxlen)
>> +                                u64 *objno, u64 *objoff, u64 *xlen)
>>  {
>> -     u32 osize = layout->object_size;
>> -     u32 su = layout->stripe_unit;
>> -     u32 sc = layout->stripe_count;
>> -     u32 bl, stripeno, stripepos, objsetno;
>> -     u32 su_per_object;
>> -     u64 t, su_offset;
>> +     u32 stripes_per_object = l->object_size / l->stripe_unit;
>> +     u64 blockno;    /* which su */
>
> Maybe /* which su within file */  (or maybe "file" won't be the proper
> general term any more?).

Updated to say "which su in the file (i.e. globally)".

>
> Wouldn't "suno" (or just "su", or maybe "which_su" with no comment) be
> a better name?

Yes, "suno" is probably better -- block here is just a synonym for
stripe unit.  "blockno" came from the reference implementation in
src/osdc/Striper.cc.

block is also what "bl" in the old code stood for, I believe.

>
>> +     u32 blockoff;   /* offset into su */
>
> You could maybe add a "block_off" line in your helpful picture above.
> blockoff |  0  1  2  3  |  0  1  2  3 |  0  1  2  3 |  0  1  2  3 |  0  1  2

That's not correct.  blockoff is computed as file offset % su, so it is
an offset, not some kind of index.

>
> Somehow it would be nice to have consistency for these remainders.
> Like maybe "blockpos" rather than "blockoff" to match the rest.  Or
> maybe the others should change, or maybe neither "off" nor "pos" is
> the right suffix.  In any case, what you're doing has such nice
> uniformity below, e.g.:
>     num = ldiv_rem(val, unit_size, &remainder);
> it would be nice if all the names followed consistent patterns.

Let's take stripepos as an example.  You can think of it as an index
(position) into an array of stripe units within a stripe.

blockoff OTOH is a byte _offset_ into a stripe unit.  I guess one could
argue the case of treating a stripe unit as an array of bytes, but that
would make it a hell of a lot more confusing.

>
> If I had a more specific suggestion I'd provide it; hopefully my
> meaning is clear, and you can consider it.
>
>> +     u64 stripeno;   /* which stripe */
>> +     u32 stripepos;  /* which su in the stripe,
>> +                        which object in the object set */
>
> So a "stripe unit" is a fixed number of "blocks".  A "stripe" is composed

That number is always 1 -- block here is just a synonym for stripe
unit.

> of one "stripe unit" from each of "stripe_count" objects (and those objects
> together form an "object set").  An object holds (up to) "stripes_per_object"
> stripe units.  That number corresponds also to the number of stripes that
> span an object set before moving on to the next object set for subsequent
> data.  Is that all correct?

Yes.

>
> I don't know where the terms are formally defined but this function would
> be as good a place as any in the code to define them.  This stuff can be
> really confusing, so using well-defined terms is important.  (I've seen
> different naming schemes for striping before, and it leads to much confusion.)

"object set" is defined at http://docs.ceph.com/docs/master/dev/file-striping/.
The rest is just struct ceph_file_layout fields.

>
>> +     u64 objsetno;   /* which object set */
>> +     u32 objsetpos;  /* which stripe in the object set */
>> +
>> +     blockno = div_u64_rem(off, l->stripe_unit, &blockoff);> +       stripeno = div_u64_rem(blockno, l->stripe_count, &stripepos);
>> +     objsetno = div_u64_rem(stripeno, stripes_per_object, &objsetpos);
>> +
>> +     *objno = objsetno * l->stripe_count + stripepos;
>> +     *objoff = objsetpos * l->stripe_unit + blockoff;
>
> A one-line comment explaining the truncation below wouldn't hurt.  Perhaps
> just at top of the function.
>
>                                         -Alex
>
>> +     *xlen = min_t(u64, len, l->stripe_unit - blockoff);

It is sort of already there:

  + * Fill in objno, offset into object, and object extent length (i.e. the
  + * number of bytes mapped, less than or equal to @l->stripe_unit).

l->stripe_unit (and consequently blockoff) are both u32.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux