Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op

Alex Elder <elder@xxxxxxxx> · Tue, 25 Feb 2014 07:05:40 -0600

On 02/25/2014 06:52 AM, Ilya Dryomov wrote:
> On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@xxxxxxxx> wrote:
>> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>>> This is primarily for rbd's benefit and is supposed to combat
>>> fragmentation:
>>>
>>> "... knowing that rbd images have a 4m size, librbd can pass a hint
>>> that will let the osd do the xfs allocation size ioctl on new files so
>>> that they are allocated in 1m or 4m chunks.  We've seen cases where
>>> users with rbd workloads have very high levels of fragmentation in xfs
>>> and this would mitigate that and probably have a pretty nice
>>> performance benefit."
>>>
>>> SETALLOCHINT is considered advisory, so our backwards compatibility
>>> mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
>>>
>>> Signed-off-by: Ilya Dryomov <ilya.dryomov@xxxxxxxxxxx>
>>
>> I have a few small comments for you to consider but this
>> looks good to me.
>>
>> Reviewed-by: Alex Elder <elder@xxxxxxxxxx>

. . .

>> The other thing is that the expected size is limited by
>> rbd_image_header->obj_order, which is a single byte.  I
>> think you should encode this the same way.  Even if the
>> hint were for more than RBD, this level of granularity
>> may be sufficient.
>>
>>> +                     u64 expected_write_size;
>>
>> Probably the same thing here, a byte might be enough
>> to encode this by using log2(expected_write_size).
>>
>>> +                     __u8 expected_size_probability;
> 
> I think in the interest of genericity expected_object_size should be an
> arbitrary, not limited to a power-of-2 value.  Now, given the current
> 90M object size limit 64 bits might seem a lot, but extent offset and
> length are 64 bit values and to be future-proof I followed that here.

I have no objection to the 64-bit size but I still think
a byte representing log2(size) is sufficient.  Power-of-2
granularity is most likely fine (and might even be what
such a hint gets converted to anyway) for file systems
or other backing store.  But again, you can do that with
a 64 bit value as well.

> expected_size_probability is currently unused.  It was supposed to be
> a 0-100 value, indicating whether we expect the object to be smaller
> than expected_object_size or not and how strong that expectation is.
> 
> I'm not sure if you've seen it, but this (both userspace and kernel
> sides) were implemented according to the design laid out by Sage in the
> "rados io hints" thread on ceph-devel about a month ago.  There wasn't
> any discussion that I'm aware of in the interim, that is until the
> recent "wip-hint" thread, which was triggered by my PR.

I'm aware of it---I saw the discussion go by and skimmed
through it but did not look at it very closely.  I can't
keep up with all the traffic but I do try to pay attention
to code for review.

>> I'm not sure why these single-byte values use __u8 while the
>> multi-byte values use (e.g.) u32.  The leading underscores
>> are supposed to be used for values exposed to user space (or
>> something like that).  Anyway, not a problem with your change,
>> but something maybe you could check into.
> 
> I'm not sure either.  I vaguely assumed it's related to the fact that
> single-byte ceph_osd_req_op fields are directly assigned to the
> corresponding ceph_osd_op fields which are exported to userspace,
> whereas the multi-byte values go through the endianness conversion
> macros, which change the type to __le*.

Not a big deal regardless.

>>> +             } hint;
>>>       };
>>>  };

. . .

>>> +     /*
>>> +      * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
>>> +      * not worth a feature bit.  Set FAILOK per-op flag to make
>>> +      * sure older osds don't trip over an unsupported opcode.
>>> +      */
>>> +     op->flags |= CEPH_OSD_OP_FLAG_FAILOK;
>>
>> I think this is reasonable.  The only thing I wonder about is
>> whether there could be any other failure that could occur on
>> an OSD server that actually supports the op that we would care
>> about.  It's still advisory though, so functionally it won't
>> matter.
> 
> Currently there isn't any such failure.  In fact, the only thing that
> this FAILOK hides is the EOPNOTSUPP from an OSD that doesn't support
> the new op.  Everything else is done on the backend, and there all
> errors are explicitly ignored because this is an advisory op and it
> would bring down the OSD otherwise.

Sounds good.  Thanks for the explanations.

					-Alex

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html