On Fri, 17 Jan 2014, Sage Weil wrote: > I think we need to add rados operations that provide hints to rados about > what the expected object sizes, alignment, and cacheability is. In > particularly, I see two main wins: > > - knowing that rbd images have a 4m size, librbd can pass a hint that > will let the osd do the xfs allocation size ioctl on new files so that > they are allocated in 1m or 4m chunks. We've seen cases where users with > rbd workloads have very high levels of fragmentation in xfs and this would > mitigate that and probably have a pretty nice performance benefit. > > - If the rbd (or other client) cache is enabled, we can pass a hint that > indicates that the OSD shouldn't keep the object pages around in cache. > This would just translate into an fadvise DONTNEED or similar. > > I think the challenge is to keep this as generic as possible from the > client's perspetive, but make sure that there is enough information to > translate it into a good set of low-level hints to the underlying backend > (like alignment size and fadvise). For example, intuitively the 1m > allocation unit sounds about right to me, but rbd would probably > communicate to rados that the objects are expected to be 4m each (or > whatever the striping strategy is). I'm thinking the "we shouldn't do an > allocation unit more than 1m" logic should live in the FileStore, tunable > via a config option? Offline feedback is, "yes we should do this," so let me propose some specifics. * For cache hints, let's just mirror what fadvise is doing. That makes it easy to just translate it down to the lower layers as is without being clever. Also, I think it's a reasonable complete set of hints. We can add a bitfield indicating whether it applies to data or omap or both, and we can make the default both. (Not sure we can plumb it into leveldb, but maybe in the future or with another backend.) CEPH_OSD_OP_FADVISE flags normal sequential random noreuse willneed dontneed what data k/v * The flash people tell us that it would be very useful to know what the expected temperature or lifetime of a block/key/object/whatever will be. Maybe a 'shortlived' and 'longlived' flag added to the above? * Allocation hint: I think the biggest immediate win we can get is poking XFS's allocation unit to ~1M or similar for rbd images to avoid excessive fragmentation. How about: CEPH_OSD_OP_ALLOCHINT expected_size -- expected size of the object expected_write_size -- what kinds of writes we expect to see from the client This is harder because these don't translate directly into what we want (XFS to alloc in larger units), but I don't think it's a good idea to expose that level of detail to, say, librbd. Instead, librbd should say "the objects will max out at 4MB" and "my writes are usually this big". Maybe we could add in an "expected sparseness", or "expected_size_probability" so that we can indicate whether we expect things to be less than 4MB or not? Then for cephfs, for example, we would say the object will be 4MB but we *wouldn't* assume that they are all going to be that big (because there are many small files). Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html