On 15/06/2015 17:10, Robert LeBlanc wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 John, let me see if I understand what you are saying... When a person runs `rbd top`, each OSD would receive a message saying please capture all the performance, grouped by RBD and limit it to 'X'. That way the OSD doesn't have to constantly update performance for each object, but when it is requested it starts tracking it?
Right, initially the OSD isn't collecting anything, it starts as soon as it sees a query get loaded up (published via OSDMap or some other mechanism).
That said, in practice I can see people having some set of queries that they always have loaded and feeding into graphite in the background.
If so, that is an interesting idea. I wonder if that would be simpler than tracking the performance of each/MRU objects in some format like /proc/diskstats where it is in memory and not necessarily consistent. The benefit is that you could have "lifelong" stats that show up like iostat and it would be a simple operation.
Hmm, not sure we're on the same page about this part, what I'm talking about is all in memory and would be lost across daemon restarts. Some other component would be responsible for gathering the stats across all the daemons in one place (that central part could persist stats if desired).
Each object should be able to reference back to RBD/CephFS upon request and the client could even be responsible for that load. Client performance data would need stats in addition to the object stats.
You could extend the mechanism to clients. However, as much as possible it's a good thing to keep it server side, as servers are generally fewer (still have to reduce these stats across N servers to present to user), and we have multiple client implementations (kernel/userspace). What kind of thing do you want to get from clients?
My concern is that adding additional SQL like logic to each op is going to get very expensive. I guess if we could push that to another thread early in the op, then it might not be too bad. I'm enjoying the discussion and new ideas.
Hopefully in most cases the query can be applied very cheaply, for operations like comparing pool ID or grouping by client ID. However, I would also envisage an optional sampling number, such that e.g. only 1 in every 100 ops would go through the query processing. Useful for systems where keeping highest throughput is paramount, and the numbers will still be useful if clients are doing many thousands of ops per second.
Cheers, John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html