The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote : > Cc'ing libvirt following Stefan's lead. > > Benoît Canet <benoit.canet@xxxxxxxxxxx> writes: > > > Hi, > > > > I collected some items of a cloud provider wishlist regarding I/O accouting. > > Feedback from real power-users, lovely! > > > In a cloud I/O accouting can have 3 purpose: billing, helping the customers > > and doing metrology to help the cloud provider seeks hidden costs. > > > > I'll cover the two former topic in this mail because they are the most important > > business wize. > > > > 1) prefered place to collect billing IO accounting data: > > -------------------------------------------------------- > > For billing purpose the collected data must be as close as possible to what the > > customer would see by using iostats in his vm. > > Good point. > > > The first conclusion we can draw is that the choice of collecting IO accouting > > data used for billing in the block devices models is right. > > Slightly rephrasing: doing I/O accounting in the block device models is > right for billing. > > There may be other uses for I/O accounting, with different preferences. > For instance, data on how exactly guest I/O gets translated to host I/O > as it flows through the nodes in the block graph could be useful. I think this is the third point that I named as metrology. Basically it boils down to "Where are the hidden IO costs of the QEMU block layer". > > Doesn't diminish the need for accurate billing information, of course. > > > 2) what to do with occurences of rare events: > > --------------------------------------------- > > > > Another point is that QEMU developpers agree that they don't know which policy > > to apply to some I/O accounting events. > > Must QEMU discard invalid I/O write IO or account them as done ? > > Must QEMU count a failed read I/O as done ? > > > > When discusting this with a cloud provider the following appears: > > these decisions > > are really specific to each cloud provider and QEMU should not implement them. > > Good point, consistent with the old advice to avoid baking policy into > inappropriately low levels of the stack. > > > The right thing to do is to add accouting counters to collect these events. > > > > Moreover these rare events are precious troubleshooting data so it's > > an additional > > reason not to toss them. > > Another good point. > > > 3) list of block I/O accouting metrics wished for billing and helping > > the customers > > ----------------------------------------------------------------------------------- > > > > Basic I/O accouting data will end up making the customers bills. > > Extra I/O accouting informations would be a precious help for the cloud provider > > to implement a monitoring panel like Amazon Cloudwatch. > > These are the first two from your list of three purposes, i.e. the ones > you promised to cover here. > > > Here is the list of counters and statitics I would like to help > > implement in QEMU. > > > > This is the most important part of the mail and the one I would like > > the community > > review the most. > > > > Once this list is settled I would proceed to implement the required > > infrastructure > > in QEMU before using it in the device models. > > For context, let me recap how I/O accounting works now. > > The BlockDriverState abstract data type (short: BDS) can hold the > following accounting data: > > uint64_t nr_bytes[BDRV_MAX_IOTYPE]; > uint64_t nr_ops[BDRV_MAX_IOTYPE]; > uint64_t total_time_ns[BDRV_MAX_IOTYPE]; > uint64_t wr_highest_sector; > > where BDRV_MAX_IOTYPE enumerates read, write, flush. > > wr_highest_sector is a high watermark updated by the block layer as it > writes sectors. > > The other three are *not* touched by the block layer. Instead, the > block layer provides a pair of functions for device models to update > them: > > void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie, > int64_t bytes, enum BlockAcctType type); > void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie); > > bdrv_acct_start() initializes cookie for a read, write, or flush > operation of a certain size. The size of a flush is always zero. > > bdrv_acct_done() adds the operations to the BDS's accounting data. > total_time_ns is incremented by the time between _start() and _done(). > > You may call _start() without calling _done(). That's a feature. > Device models use it to avoid accounting some requests. > > Device models are not supposed to mess with cookie directly, only > through these two functions. > > Some device models implement accounting, some don't. The ones that do > don't agree on how to count invalid guest requests (the ones not passed > to block layer) and failed requests (passed to block layer and failed > there). It's a mess in part caused by us never writing down what > exactly device models are expected to do. > > Accounting data is used by "query-blockstats", and nothing else. > > Corollary: even though every BDS holds accounting data, only the ones in > "top" BDSes ever get used. This is a common block layer blemish, and > we're working on cleaning it up. > > If a device model doesn't implement accounting, query-blockstats lies. > Fortunately, its lies are pretty transparent (everything's zero) as long > as you don't do things like connecting a backend to a device model that > doesn't implement accounting after disconnecting it from a device model > that does. Still, I'd welcome a more honest QMP interface. > > For me, this accounting data belongs to the device model, not the BDS. > Naturally, the block device models should use common infrastructure. I > guess they use the block layer only because it's obvious infrastructure > they share. Clumsy design. > > > /* volume of data transfered by the IOs */ > > read_bytes > > write_bytes > > This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE]. > > nr_bytes[BDRV_ACCT_FLUSH] is always zero. > > Should this count only actual I/O, i.e. accumulated size of successful > operations? > > > /* operation count */ > > read_ios > > write_ios > > flush_ios > > > > /* how many invalid IOs the guest submit */ > > invalid_read_ios > > invalid_write_ios > > invalid_flush_ios > > > > /* how many io error happened */ > > read_ios_error > > write_ios_error > > flush_ios_error > > This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE], > nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed. > > > /* account the time passed doing IOs */ > > total_read_time > > total_write_time > > total_flush_time > > This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE], > total_time_ns[BDRV_ACCT_FLUSH]. > > I guess this should count both successful and failed I/O. Could throw > in invalid, too, but it's probably too quick to matter. > > Please specify the unit clearly. Both total_FOO_time_ns or total_FOO_ns > would work for me. Yes _ns is fine for me too. > > > /* since when the volume is iddle */ > > qvolume_iddleness_time > > "idle" > > The obvious way to maintain this information with the current could > would be saving the value of get_clock() in bdrv_acct_done(). > > > /* the following would compute latecies for slices of 1 seconds then toss the > > * result and start a new slice. A weighted sumation of the instant latencies > > * could help to implement this. > > */ > > 1s_read_average_latency > > 1s_write_average_latency > > 1s_flush_average_latency > > > > /* the former three numbers could be used to further compute a 1 > > minute slice value */ > > 1m_read_average_latency > > 1m_write_average_latency > > 1m_flush_average_latency > > > > /* the former three numbers could be used to further compute a 1 hours > > slice value */ > > 1h_read_average_latency > > 1h_write_average_latency > > 1h_flush_average_latency > > This is something like "what we added to total_FOO_time in the last > completed 1s / 1m / 1h time slice divided by the number of additions". > Just another way to accumulate the same raw data, thus no worries. > > > /* 1 second average number of requests in flight */ > > 1s_read_queue_depth > > 1s_write_queue_depth > > > > /* 1 minute average number of requests in flight */ > > 1m_read_queue_depth > > 1m_write_queue_depth > > > > /* 1 hours average number of requests in flight */ > > 1h_read_queue_depth > > 1h_write_queue_depth > > I guess this involves counting bdrv_acct_start() and bdrv_acct_done(). > The "you need not call bdrv_acct_done()" feature may get in the way. > Solvable. > > Permit me a short detour into the other use for I/O accounting I > mentioned: data on how exactly guest I/O gets translated to host I/O as > it flows through the nodes in the block graph. Do you think this would > be pretty much the same data, just collected at different points? That something I would like to take care in a further sub project. Optionally collecting the same data for each BDS of the graph. > > > 4) Making this happen > > ------------------------- > > > > Outscale want to make these IO stat happen and gave me the go to do whatever > > grunt is required to do so. > > That said we could collaborate on some part of the work. > > Cool! > > A quick stab at tasks: > > * QMP interface, either a compatible extension of query-blockstats or a > new one. I would like to extend query-blockstat in a first time but I also would like to postpone the QMP interface changes and just write the shared infrastructure and deploy it in the device models. > > * Rough idea on how to do the shared infrastructure. -API wize I think about adding bdrv_acct_invalid() and bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap. -To calculate the averages I think about a global timer firing every seconds and iterating of the bds list to make the computations even when there is no IO activity. Is it acceptable to have a qemu_mutex by statitic structure ? > > * Implement (can be split up into several tasks if desired) First I would like to write a series implementing a backward-compatible API and get it merged. Then the deployment of the new API specifics in the devices models can be splitted/parallelized. Best regards Benoît > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list