On Mon, Jun 15, 2015 at 8:03 AM, John Spray <john.spray@xxxxxxxxxx> wrote: > > > On 15/06/2015 14:52, Sage Weil wrote: >> >> >> I seem to remember having a short conversation about something like this a >> few CDS's back... although I think it was 'rados top'. IIRC the basic >> idea we had was for each OSD to track it's top clients (using some >> approximate LRU type algorithm) and then either feed this relatively small >> amount of info (say, top 10-100 clients) back to the mon for summation, >> or dump via the admin socket for calamari to aggregate. >> >> This doesn't give you the rbd image name, but I bet we could infer that >> without too much trouble (e.g., include a recent object or two with the >> client). Or, just assume that client id is enough (it'll include an IP >> and PID... enough info to find the /var/run/ceph admin socket or the VM >> process. >> >> If we were going to do top clients, I think it'd make sense to also have a >> top objects list as well, so you can see what the hottest objects in the >> cluster are. > > > The following is a bit of a tangent... > > A few weeks ago I was thinking about general solutions to this problem (for > the filesystem). I played with (very briefly on wip-live-query) the idea of > publishing a list of queries to the MDSs/OSDs, that would allow runtime > configuration of what kind of thing we're interested in and how we want it > broken down. > > If we think of it as an SQL-like syntax, then for the RBD case we would have > something like: > SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image > > (You'd need a protocol-specific module of some kind to define what > "rbd_image" meant here, which would do a simple mapping from object > attributes to an identifier (similar would exist for e.g. cephfs inode)) > > Each time an OSD does an operation, it consults the list of active > "performance queries" and updates counters according to the value of the > GROUP BY parameter for the query (so the above example each OSD would be > keeping a result row for each rbd image touchd). > > The LRU part could be implemented as LIMIT BY + SORT parameters, such that > the result rows would be periodically sorted and the least-touched results > would drop off the list. That would probably be used in conjunction with a > decay operator on the sorted-by field, like: > SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image SORT > BY movingAverage(derivative(ops)) LIMIT 100 > > Combining WHERE clauses would let the user "drill down" (apologies for > buzzword) by doing things like identifying the most busy clients, and then > for each of those clients identify which images/files/objects the client is > most active on, or vice versa identify busy objects and then see which > clients are hitting them. Usually keeping around enough stats to enable this > is prohibitive at scale, but it's fine when you're actively creating custom > queries for the results you're really interested in, instead of keeping > N_clients*N_objects stats, and when you have the LIMIT part to ensure > results never get oversized. > > The GROUP BY options would also include metadata sent from clients, e.g. the > obvious cases like VM instance names, or rack IDs, or HPC job IDs. Maybe > also some less obvious ones like decorating cephfs IOs with the inode of the > directory containing the file, so that OSDs could accumulate per-directory > bandwidth numbers, and user could ask "which directory is > bandwidth-hottest?" as well as "which file is bandwidth-hottest?". > > Then, after implementing all that craziness, you get some kind of wild > multicolored GUI that shows you where the action is in your system at a > cephfs/rgw/rbd level. I *like* that idea. We should discuss with Sam before doing too much though as I know he's thought about various online computations in RADOS before. Something like this is also interesting in comparison to our long-theorized "PG classes", etc. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html