Re: RFC: progress bars

John Spray <john.spray@xxxxxxxxxx> · Thu, 28 May 2015 17:52:04 +0100

On 28/05/2015 17:41, Robert LeBlanc wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Let me see if I understand this... Your idea is to have a progress bar
that show (active+clean + active+scrub + active+deep-scrub) / pgs and
then estimate time remaining?

Not quite: it's not about doing a calculation on the global PG state 
counts.  The code identifies specific PGs affected by specific 
operations, and then watches the status of those PGs.

So if PGs are split the numbers change and the progress bar go
backwards, is that a big deal?

I don't see a case where the progress bars go backwards with the code I 
have so far?  In the case of operations on PGs that split, it'll just 
ignore the new PGs, but you'll get a separate event tracking the 
creation of the new ones.  In general, progress bars going backwards 
isn't something we should allow to happen (happy to hear counter 
examples though, I'm mainly speaking from intuition on that point!)

If this was extended to track operations across PG splits (it's unclear 
to me that that complexity is worthwhile), then the bar still wouldn't 
need to go backwards, as whatever stat was being tracked would remain 
the same when summed across the newly split PGs.

I don't think so, it might take a
little time to recalculate how long it will take, but no big deal. I
do like the idea of the progress bar even if it is fuzzy. I keep
running ceph status or ceph -w to watch things and have to imagine it
in my mind.

Right, the idea is to save the admin from having to interpret PG counts 
mentally.

It might be nice to have some other stats like client I/O
and rebuild I/O so that I can see if recovery is impacting production
I/O.

We already have some of these stats globally, but it would be nice to be 
able to reason about what proportion of I/O is associated with specific 
operations, e.g. "I have some total recovery IO number, what proportion 
of that is due to a particular drive failure?". Without going and 
looking at current pg stat structures I don't know if there is enough 
data in the mon right now to guess those numbers.  This would 
*definitely* be heuristic rather than exact, in any case.

Cheers,
John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html