Hi Sage, > On Dec 19, 2017, at 11:13 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > Hi Eric, > > On Mon, 18 Dec 2017, J. Eric Ivancich wrote: >> == Units == >> >> One key difference we're considering, however, is changing the unit >> that reservations and limits are expressed in from ops/sec to >> something more appropriate for Ceph. Operations have payloads of >> different sizes and will therefore take different amounts of time, and >> that should be factored in. We might refer to this as the "cost" of >> the operation. And the cost is not linear with the size of the >> payload. For example, a write of 4 MB might only take 20 times as long >> as a write of 4 KB even though the sizes differ by a factor of >> 1000. Using cost would allow us to, for example, achieve a fairer >> prioritization of a client doing many small writes against a client >> that's doing a few larger writes. >> >> One proposed formula to translate one op into cost would be something >> along the lines of: >> >> cost_units = a + b * log(payload_size) >> >> where a and b would have to be chosen or tuned based on the storage >> back-end. >> >> And that gets us to the units for defining reservation and limit -- >> cost_units per unit of time. Typically these are floating point >> values, however we do not use floating point types in librados calls >> because qemu, when calling into librbd, does not save and restore the >> cpu's floating point mode. >> >> There are two ways of getting appropriate ranges of values given that >> we need to use integral types for cost_units per unit of time. One is >> a large time unit in the denominator, such as minutes or even >> hours. That would leave us with cost_units per minute. We are unsure >> that the strange unit is the best approach and your feedback would be >> appreciated. >> >> A standard alternative would be to use a standard time unit, such as >> seconds, but integers as fixed-point values. So a floating-point value >> in cost_units per second would be multiplied by, say, 1000 and rounded >> to get the corresponding integer value. > > I think if payload_size above is bytes, then any reasonable value for > cost_units will be a non-tiny integer, and we won't need floating point, > right? E.g., a 4KB write would be (at a minimum) 10, but probably larger > if a and b >= 1. That would let us keep seconds as the time base? Very good point! As long as the cost is greater than 10 (maybe even much greater than 10) a reservation or limit as low as 1 would be small, and we can avoid both odd time unit denominators and fixed-point and still achieve low settings. >> == librados Additions == >> >> The basic idea is that one would be able to create (and destroy) qos >> profiles and then associate a profile with an ioctx. Ops on the ioctx >> would use the qos profile associated with it. >> >> typedef void* rados_qos_profile_t; // opaque >> >> // parameters uint64_t in cost_units per time unit as discussed above >> profile1 = rados_qos_profile_create(reservation, weight, limit); >> >> rados_ioctx_set_qos_profile(ioctx3, profile1); >> >> ... >> // ops to ioctx3 would now use the specified profile >> ... >> >> // use the profile just for a particular operation >> rados_write_op_set_qos_prefile(op1, profile1); >> >> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile >> >> rados_qos_profile_destroy(profile1); > > I would s/destroy/release/, as the profile will be implicitly > reference counted (with a ref consumed by the ioctx that is pointing to > it). > > It might be useful to add a rados_qos_profile_get_id(handle) that returns > the client-local integer id that we're using to identify the profile. > This isn't really useful for the application per se, but it will be > helpful for debugging purposes perhaps? Both sound good. >> == MOSDOp and MOSDOpReply Changes == >> >> Because the qos_profile would be managed by the ioctx, MOSDOps sent >> via that ioctx would include the reservation, weight, and limit. At >> this point we think this would be better than keeping the profiles on >> the back-end, although it increases the MOSDOp data structure by about >> 128 bits. >> >> The MOSDOp type already contains dmclock's delta and rho parameters >> and MOSDOpReply already contains the dmclock phase indicator due to >> prior work. Given that we're moving towards using cost_unit per >> time_unit rather than ops per sec, perhaps we should also include the >> calculated cost in the MOSDOpReply. > > Good idea! Thank you, Eric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html