On 02/25/2014 06:52 AM, Ilya Dryomov wrote: > On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@xxxxxxxx> wrote: >> On 02/21/2014 12:55 PM, Ilya Dryomov wrote: >>> This is primarily for rbd's benefit and is supposed to combat >>> fragmentation: >>> >>> "... knowing that rbd images have a 4m size, librbd can pass a hint >>> that will let the osd do the xfs allocation size ioctl on new files so >>> that they are allocated in 1m or 4m chunks. We've seen cases where >>> users with rbd workloads have very high levels of fragmentation in xfs >>> and this would mitigate that and probably have a pretty nice >>> performance benefit." >>> >>> SETALLOCHINT is considered advisory, so our backwards compatibility >>> mechanism here is to set FAILOK flag for all SETALLOCHINT ops. >>> >>> Signed-off-by: Ilya Dryomov <ilya.dryomov@xxxxxxxxxxx> >> >> I have a few small comments for you to consider but this >> looks good to me. >> >> Reviewed-by: Alex Elder <elder@xxxxxxxxxx> . . . >> The other thing is that the expected size is limited by >> rbd_image_header->obj_order, which is a single byte. I >> think you should encode this the same way. Even if the >> hint were for more than RBD, this level of granularity >> may be sufficient. >> >>> + u64 expected_write_size; >> >> Probably the same thing here, a byte might be enough >> to encode this by using log2(expected_write_size). >> >>> + __u8 expected_size_probability; > > I think in the interest of genericity expected_object_size should be an > arbitrary, not limited to a power-of-2 value. Now, given the current > 90M object size limit 64 bits might seem a lot, but extent offset and > length are 64 bit values and to be future-proof I followed that here. I have no objection to the 64-bit size but I still think a byte representing log2(size) is sufficient. Power-of-2 granularity is most likely fine (and might even be what such a hint gets converted to anyway) for file systems or other backing store. But again, you can do that with a 64 bit value as well. > expected_size_probability is currently unused. It was supposed to be > a 0-100 value, indicating whether we expect the object to be smaller > than expected_object_size or not and how strong that expectation is. > > I'm not sure if you've seen it, but this (both userspace and kernel > sides) were implemented according to the design laid out by Sage in the > "rados io hints" thread on ceph-devel about a month ago. There wasn't > any discussion that I'm aware of in the interim, that is until the > recent "wip-hint" thread, which was triggered by my PR. I'm aware of it---I saw the discussion go by and skimmed through it but did not look at it very closely. I can't keep up with all the traffic but I do try to pay attention to code for review. >> I'm not sure why these single-byte values use __u8 while the >> multi-byte values use (e.g.) u32. The leading underscores >> are supposed to be used for values exposed to user space (or >> something like that). Anyway, not a problem with your change, >> but something maybe you could check into. > > I'm not sure either. I vaguely assumed it's related to the fact that > single-byte ceph_osd_req_op fields are directly assigned to the > corresponding ceph_osd_op fields which are exported to userspace, > whereas the multi-byte values go through the endianness conversion > macros, which change the type to __le*. Not a big deal regardless. >>> + } hint; >>> }; >>> }; . . . >>> + /* >>> + * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed >>> + * not worth a feature bit. Set FAILOK per-op flag to make >>> + * sure older osds don't trip over an unsupported opcode. >>> + */ >>> + op->flags |= CEPH_OSD_OP_FLAG_FAILOK; >> >> I think this is reasonable. The only thing I wonder about is >> whether there could be any other failure that could occur on >> an OSD server that actually supports the op that we would care >> about. It's still advisory though, so functionally it won't >> matter. > > Currently there isn't any such failure. In fact, the only thing that > this FAILOK hides is the EOPNOTSUPP from an OSD that doesn't support > the new op. Everything else is done on the backend, and there all > errors are explicitly ignored because this is an advisory op and it > would bring down the OSD otherwise. Sounds good. Thanks for the explanations. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html