The Monday 01 Sep 2014 à 13:41:01 (+0200), Markus Armbruster wrote : > Benoît Canet <benoit.canet@xxxxxxxxxxx> writes: > > > The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote : > >> Cc'ing libvirt following Stefan's lead. > >> > >> Benoît Canet <benoit.canet@xxxxxxxxxxx> writes: > >> > >> > Hi, > >> > > >> > I collected some items of a cloud provider wishlist regarding I/O accouting. > >> > >> Feedback from real power-users, lovely! > >> > >> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers > >> > and doing metrology to help the cloud provider seeks hidden costs. > >> > > >> > I'll cover the two former topic in this mail because they are the > >> > most important > >> > business wize. > >> > > >> > 1) prefered place to collect billing IO accounting data: > >> > -------------------------------------------------------- > >> > For billing purpose the collected data must be as close as > >> > possible to what the > >> > customer would see by using iostats in his vm. > >> > >> Good point. > >> > >> > The first conclusion we can draw is that the choice of collecting > >> > IO accouting > >> > data used for billing in the block devices models is right. > >> > >> Slightly rephrasing: doing I/O accounting in the block device models is > >> right for billing. > >> > >> There may be other uses for I/O accounting, with different preferences. > >> For instance, data on how exactly guest I/O gets translated to host I/O > >> as it flows through the nodes in the block graph could be useful. > > > > I think this is the third point that I named as metrology. > > Basically it boils down to "Where are the hidden IO costs of the QEMU > > block layer". > > Understood. > > >> Doesn't diminish the need for accurate billing information, of course. > >> > >> > 2) what to do with occurences of rare events: > >> > --------------------------------------------- > >> > > >> > Another point is that QEMU developpers agree that they don't know > >> > which policy > >> > to apply to some I/O accounting events. > >> > Must QEMU discard invalid I/O write IO or account them as done ? > >> > Must QEMU count a failed read I/O as done ? > >> > > >> > When discusting this with a cloud provider the following appears: > >> > these decisions > >> > are really specific to each cloud provider and QEMU should not > >> > implement them. > >> > >> Good point, consistent with the old advice to avoid baking policy into > >> inappropriately low levels of the stack. > >> > >> > The right thing to do is to add accouting counters to collect these events. > >> > > >> > Moreover these rare events are precious troubleshooting data so it's > >> > an additional > >> > reason not to toss them. > >> > >> Another good point. > >> > >> > 3) list of block I/O accouting metrics wished for billing and helping > >> > the customers > >> > ----------------------------------------------------------------------------------- > >> > > >> > Basic I/O accouting data will end up making the customers bills. > >> > Extra I/O accouting informations would be a precious help for the > >> > cloud provider > >> > to implement a monitoring panel like Amazon Cloudwatch. > >> > >> These are the first two from your list of three purposes, i.e. the ones > >> you promised to cover here. > >> > >> > Here is the list of counters and statitics I would like to help > >> > implement in QEMU. > >> > > >> > This is the most important part of the mail and the one I would like > >> > the community > >> > review the most. > >> > > >> > Once this list is settled I would proceed to implement the required > >> > infrastructure > >> > in QEMU before using it in the device models. > >> > >> For context, let me recap how I/O accounting works now. > >> > >> The BlockDriverState abstract data type (short: BDS) can hold the > >> following accounting data: > >> > >> uint64_t nr_bytes[BDRV_MAX_IOTYPE]; > >> uint64_t nr_ops[BDRV_MAX_IOTYPE]; > >> uint64_t total_time_ns[BDRV_MAX_IOTYPE]; > >> uint64_t wr_highest_sector; > >> > >> where BDRV_MAX_IOTYPE enumerates read, write, flush. > >> > >> wr_highest_sector is a high watermark updated by the block layer as it > >> writes sectors. > >> > >> The other three are *not* touched by the block layer. Instead, the > >> block layer provides a pair of functions for device models to update > >> them: > >> > >> void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie, > >> int64_t bytes, enum BlockAcctType type); > >> void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie); > >> > >> bdrv_acct_start() initializes cookie for a read, write, or flush > >> operation of a certain size. The size of a flush is always zero. > >> > >> bdrv_acct_done() adds the operations to the BDS's accounting data. > >> total_time_ns is incremented by the time between _start() and _done(). > >> > >> You may call _start() without calling _done(). That's a feature. > >> Device models use it to avoid accounting some requests. > >> > >> Device models are not supposed to mess with cookie directly, only > >> through these two functions. > >> > >> Some device models implement accounting, some don't. The ones that do > >> don't agree on how to count invalid guest requests (the ones not passed > >> to block layer) and failed requests (passed to block layer and failed > >> there). It's a mess in part caused by us never writing down what > >> exactly device models are expected to do. > >> > >> Accounting data is used by "query-blockstats", and nothing else. > >> > >> Corollary: even though every BDS holds accounting data, only the ones in > >> "top" BDSes ever get used. This is a common block layer blemish, and > >> we're working on cleaning it up. > >> > >> If a device model doesn't implement accounting, query-blockstats lies. > >> Fortunately, its lies are pretty transparent (everything's zero) as long > >> as you don't do things like connecting a backend to a device model that > >> doesn't implement accounting after disconnecting it from a device model > >> that does. Still, I'd welcome a more honest QMP interface. > >> > >> For me, this accounting data belongs to the device model, not the BDS. > >> Naturally, the block device models should use common infrastructure. I > >> guess they use the block layer only because it's obvious infrastructure > >> they share. Clumsy design. > >> > >> > /* volume of data transfered by the IOs */ > >> > read_bytes > >> > write_bytes > >> > >> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE]. > >> > >> nr_bytes[BDRV_ACCT_FLUSH] is always zero. > >> > >> Should this count only actual I/O, i.e. accumulated size of successful > >> operations? > >> > >> > /* operation count */ > >> > read_ios > >> > write_ios > >> > flush_ios > >> > > >> > /* how many invalid IOs the guest submit */ > >> > invalid_read_ios > >> > invalid_write_ios > >> > invalid_flush_ios > >> > > >> > /* how many io error happened */ > >> > read_ios_error > >> > write_ios_error > >> > flush_ios_error > >> > >> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE], > >> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed. > >> > >> > /* account the time passed doing IOs */ > >> > total_read_time > >> > total_write_time > >> > total_flush_time > >> > >> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE], > >> total_time_ns[BDRV_ACCT_FLUSH]. > >> > >> I guess this should count both successful and failed I/O. Could throw > >> in invalid, too, but it's probably too quick to matter. > >> > >> Please specify the unit clearly. Both total_FOO_time_ns or total_FOO_ns > >> would work for me. > > > > Yes _ns is fine for me too. > > > >> > >> > /* since when the volume is iddle */ > >> > qvolume_iddleness_time > >> > >> "idle" > >> > >> The obvious way to maintain this information with the current could > >> would be saving the value of get_clock() in bdrv_acct_done(). > >> > >> > /* the following would compute latecies for slices of 1 seconds then toss the > >> > * result and start a new slice. A weighted sumation of the instant latencies > >> > * could help to implement this. > >> > */ > >> > 1s_read_average_latency > >> > 1s_write_average_latency > >> > 1s_flush_average_latency > >> > > >> > /* the former three numbers could be used to further compute a 1 > >> > minute slice value */ > >> > 1m_read_average_latency > >> > 1m_write_average_latency > >> > 1m_flush_average_latency > >> > > >> > /* the former three numbers could be used to further compute a 1 hours > >> > slice value */ > >> > 1h_read_average_latency > >> > 1h_write_average_latency > >> > 1h_flush_average_latency > >> > >> This is something like "what we added to total_FOO_time in the last > >> completed 1s / 1m / 1h time slice divided by the number of additions". > >> Just another way to accumulate the same raw data, thus no worries. > >> > >> > /* 1 second average number of requests in flight */ > >> > 1s_read_queue_depth > >> > 1s_write_queue_depth > >> > > >> > /* 1 minute average number of requests in flight */ > >> > 1m_read_queue_depth > >> > 1m_write_queue_depth > >> > > >> > /* 1 hours average number of requests in flight */ > >> > 1h_read_queue_depth > >> > 1h_write_queue_depth > >> > >> I guess this involves counting bdrv_acct_start() and bdrv_acct_done(). > >> The "you need not call bdrv_acct_done()" feature may get in the way. > >> Solvable. > >> > >> Permit me a short detour into the other use for I/O accounting I > >> mentioned: data on how exactly guest I/O gets translated to host I/O as > >> it flows through the nodes in the block graph. Do you think this would > >> be pretty much the same data, just collected at different points? > > > > That something I would like to take care in a further sub project. > > I'm asking now because it impacts where I'd prefer the shared > infrastructure to live. > > > Optionally collecting the same data for each BDS of the graph. > > If that's the case, keeping the shared infrastructure in the block layer > makes sense. > > BDS member acct then holds I/O stats for the BDS. We currently use it > for something else: I/O stats of the device model backed by this BDS. > That needs to move elsewhere. Two places come to mind: > > 1. BlockBackend, when it's available (I resumed working on it last week > for a bit). Superficially attractive, because it's close to what we > have now, but then we have to deal with what to do when the backend > gets disconnected from its device model, then connected to another > one. > > 2. The device models that actually implement I/O accounting. Since > query-blockstats names a backend rather than a device model, we need > a BlockDevOps callback to fetch the stats. Fetch fails when the > callback is null. Lets us distinguish "no stats yet" and "device > model can't do stats", thus permits a QMP interface that doesn't lie. > > Right now, I like (2) better. ok > > >> > 4) Making this happen > >> > ------------------------- > >> > > >> > Outscale want to make these IO stat happen and gave me the go to do whatever > >> > grunt is required to do so. > >> > That said we could collaborate on some part of the work. > >> > >> Cool! > >> > >> A quick stab at tasks: > >> > >> * QMP interface, either a compatible extension of query-blockstats or a > >> new one. > > > > I would like to extend query-blockstat in a first time but I also > > would like to postpone the QMP interface changes and just write the > > shared infrastructure and deploy it in the device models. > > Implementing improved data collection need not wait for the QMP design. > > >> * Rough idea on how to do the shared infrastructure. > > > > -API wize I think about adding > > bdrv_acct_invalid() and > > bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap. > > Complication: partial success. Example: > > 1. Guest requests a read of N sectors. > > 2. Device model calls > bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_READ) > > 3. Device model examines the request, and deems it valid. > > 4. Device model passes it to the block layer. > > 5. Block layer does its thing, but for some reason only M < N sectors > can be read. Block layer returns M. > > 6. What's the device model to do now? Both bdrv_acct_failed() and > bdrv_acct_done() would be wrong. > > Should the device model account for a read of size M? This ignores > the partial failure. > > Should it split the read into a successful and a failed part for > accounting purposes? This miscounts the number of reads. maybe bdrv_acct_partial() accounting the size of data read in the bandwith counter and keeping care of counting this event. Maybe we will discover some other rare event to account. > > > -To calculate the averages I think about a global timer firing every seconds > > and iterating of the bds list to make the computations even when there is no > > IO activity. Is it acceptable to have a qemu_mutex by statitic structure ? > > Let me try to sketch something simpler. For brevity, I cover only a > single slice, but adding mroe is just more of the same. > > Have an end-of-slice timestamp. > > When accounting an operation, compare current time to end-of-slice (we > already get the current time to measure latency, so this is cheap). If > end-of-slice is in the past, zero the slice's stats, and update > end-of-slice. > > When something asks for stats, do the same, to avoid returning stale > slice stats. > > >> * Implement (can be split up into several tasks if desired) > > > > First I would like to write a series implementing a > > backward-compatible API and get it > > merged. > > > > Then the deployment of the new API specifics in the devices models can > > be splitted/parallelized. > > Works for me. > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list