On 06/08/2016 02:04 PM, Tomasz Flendrich wrote:
Hello everyone, Let me introduce myself - I'm Tomasz Flendrich and I'm a Google Summer of Code student from University of Wrocław, Poland. My goal is to create a generic address allocator (for PCI, virtio, SCSI, etc.), because current handling of addresses has its flaws: sometimes addresses aren't properly assigned and checked for duplicates.
I know you took this sentence straight from the wiki page of GSoC ideas, but I'd be interested to see a few specific examples of the problems that you intend to solve (maybe Martin can help here, since I think he's the one who wrote that).
I know of a few problems with PCI address allocation, but abstracting the data structure and functions used for allocation aren't going to solve them (for example, if you attach a device that uses PCI with "virsh attach-device .... --live", then then add another with "virsh attach-device --live --config", the new device will be assigned a different address in the persistent config than in the live domain (the result will be that the next time you shutdown and restart, the device will appear at a different address, which could confuse the guest OS). Another (which I need to fix *very* soon) is that even when a device is classified as connecting to PCIe slots rather than PCI (and there aren't enough of those right now, which is *yet another* problem I need to fix), if there aren't any empty PCIe slots available to assign, we'll fail rather than adding a new controller of the appropriate type and attached to the appropriate existing controller). But, as I said, a new data structure isn't going to solve any of those problems (and I'm not sure it will make the solution any easier to get to either).
Additionally, a generic solution will be very useful when more hypervisors add the feature of explicitly stating the addresses of devices. The key goal I'm willing to achieve at this moment is defining a minimum viable product (one thing to be implemented quickly which is reasonably complete on its own, and has potential for further development in the direction which makes sense). I came up with an initial design. The internal allocator's data will be kept in struct Allocator. There will be one structure for one address type and all of the allocators would be in an array. We will have the possibility of adding an address pool to the allocator. For example, when we create a root PCI bus, we tell the allocator that the following PCI addresses are possible: {0..0}:{0..31}.{0..7}, where {a..b} means a range from a to b, inclusive.
PCI addresses are likely much more complicated than you've planned for this. For starters, each PCI bus can have a different number of slots available (the 2nd dimension, which you list as {0..31}, has different start & end indexes depending on the type of controller that provides that bus. (e.g. pci-root, pcie-root, and pci-bridge have 31 slots from 1-31, dmi-to-pci-bridge and pcie-switch-upstream-port have 32 slots from 0-31, pcie-root-port and pcie-switch-downstream-port each have a single slot at 0,), and each bus has different rules about what devices can be plugged into them (e.g. a pcie-root-port can only be plugged into pcie-root or a pcie-switch-downstream-port, some devices can only be plugged into pci-root, pci-bridge, or dmi-to-pci-bridge, some can be plugged into pcie-root directly if specifically requested, but it's otherwise avoided, etc). So it's not just a case of a 4-dimensional vector with the same size at each index on on the same axis, and each address is not the same as any other address.
Are you intending for this allocator to have any sort of "Give me the next free address" function? If so, doing this for PCI is likely to require you to add lots of extra complexity that will be completely unnecessary for the other address types (although... there are several different kinds of SCSI controller (including virtio-scsi as well as several emulations of real hardware) that all share the same address space, and we need to know what model is each controller so that the user can be sure devices are placed on the desired controller.) Of course maybe your intent is that your allocator will only implement an "allocate this specific address" function, and leave auto-assignment up to a higher layer; but then the higher layer would still need a way to efficiently scan the entire list of free addresses, and each address would need a bit of data describing the controllers at that address so they they could be matched.
(Another thing - at any time a new bus of just about any type can be added, sometimes in response to a request to allocate an address for which no matching slot is available (e.g. - adding a new pcie-root-port, pcie-switch-upstream-port, and list of pcie-downstream-ports when there are no more hotpluggable PCIe slots available)).
This function call could look like this: allocatorAddPool( allocator[PCI_ADDRESS_TYPE], &outputAddress, AllocatorRange(0, 0), AllocatorRange(0, 31), AllocatorRange(0, 7));
Again, this doesn't allow for some of the indexes on a particular dimension having a different start/end (or for other differences for a particular index, which possibly could be stored in some opaque pointer).
The outputAddress would be an array owned by the caller (only filled in by the allocator). If we were to allocate IP addresses for computers from the range 192.168.1.2 to 192.168.1.240, where the router has to have the address 192.168.1.254: First, we reserve the address for the router to let the Allocator know that it's in use, and that we can track collisions in manually assigned addresses: allocatorReserveAddress( allocator[IP_ADDRESS_TYPE], &outputAddress, AllocatorValue(192), AllocatorValue(168), AllocatorValue(1), AllocatorValue(254)); Then, we assign the addresses for the computers: allocatorReserveAddress( allocator[IP_ADDRESS_TYPE], &outputAddress, AllocatorValue(192), AllocatorValue(168), AllocatorValue(1), AllocatorRange(1, 254)); Please note that AllocatorValue() is now in use. There could be a different function call to simply assign any address: allocatorReserveAny(Allocator* allocator, &outputAddress);
Ah, there it is! This won't work for PCI (or for scsi) without much more information available for the ranges that are being allocated from.
Let's say that we want an "sda" disk. We could create a wrapper: allocatorReserveSCSIAddress(allocator, "sda"); All this wrapper does is representing the string "sda" as an integer mapping to the letter (eg. 'a' = 0, 'z' = 25): allocatorReserveAddress(allocator[SCSI_ADDRESS_TYPE], &outputAddress, 0);
Disk device names are incredibly ugly beasts. Apparently once you get to sdz, you can start over at sdaa, sdab, etc. And the whole thing is pointless anyway, because that name isn't passed directly to the guest anyway (there isn't a way to do that - the guest OS determines device names on its own based on the bus and order in which the disks appear).
If an address is already determined, because it was specified in the XML or it's some specific device that has to be at some specific address, we still reserve it to let the Allocator know that it's in use. How would this work internally? One of the possible solutions is keeping a set of ranges of addresses. For example, suppose that we have two PCI busses to use. Our address pool is stored as one range: {0..1} {0..31} {0..7} Now someone reserves address 0:13.2 Our new free addresses pool is now stored as this set of ranges: {0..0} {0..12} {0..7}, {0..0} {12..12} {0..1}, {0..0} {12..12} {3..7}, {0..0} {13..31} {0..7}, {1..1} {0..31} {0..7} If we kept every address separately, it would require 2*32*8=512 addresses.
...which isn't all that much, since we aren't talking about microcontrollers with tiny amounts of RAM.
The set data structure from gnulib's gl_oset.h is a candidate for keeping the ranges in a sorted fashion. Another candidate is simply a sorted list from gl_list.h.
Without looking at the implementation, I'm concerned that this may be adding lots of extra code complexity in order to save just a few bytes of RAM, which realistically is pointless - we only have one pool of PCI addresses per active domain, so the amount of memory used to store a bitmap of the availability of addresses is inconsequential in relation to the total memory requirement for a domain.
This structure would be able to handle all types of addresses that are convertible to a fixed-length list of integers. We don't mind how many of these integers there are, because we can use variadic arguments. It won't allow duplicates if we stick to using it in every place where we have some addresses. It will also be very generic, with the possibility of writing wrappers on top of it that might be more convenient for some particular address type. This way we would keep qemu_* files as they are for now, and just replace existing allocators (and manual assignment of addresses) to calls to our new allocator. My ideas: - A possibility of reserving a range of addresses might be used to reserve a whole PCI slot at once.
I do see the usefulness of this in general, but in this particular case rather than reserving an entire slot at once, we may want to allocate a single function where the device is going to be, but mark the rest of that slot available for further functions to be allocated, then later mark it unavailable for further allocations *without actually allocating the rest of the functions in the slot*. The reason for this is that it would make it simpler to use the same code for detaching a "lone" device on a slot as we do for detaching the functions of a multifunction PCI device. (You're likely thinking "what's the difference between allocating a single function + mark the other functions on the slot as unavailable vs. just marking the entire slot as allocated; it all has to do with being able to easily notice when the slot is once again completely availaable - if you mark all the functions in the slot as used, then when you free it you have to free all the slots; if you're doing that, then you have to figure out whether the device you'r freeing was one function of a multifunction device, or if it was a lone device plugged into a slot by itself. Of course that may not fit in with your abstracted view of generic address allocation.) Right now we "ham fist" this issue by just marking the entire slot as allocated when someone asks for a single function, but really we should do allocations (and frees) one function at a time, but just with a knowledge of when it's appropriate to allocate a 2nd (and further) function on a slot that already has some functions allocated, and when it isn't.
- Flags could be added, changing the behavior of particular addresses or the whole allocator. My questions: - Is this approach good long-term?
Again, my concern would be about making something more complex than necessary for the individual uses and their requirements. Eliminating nearly duplicate code is good, but you need to be careful that what you create doesn't require more "glue" to make each of the cases fit into the abstract model than would have been used for a separate simpler "from-scratch" implementation for each individual case.
- Is address releasing expected to be required at any point? (think device hotplug/unplug, etc.)
Yes. And we need to be aware that sometimes devices are added to the live domain but not the persistent config, sometimes only to the persistent config, and sometimes to both (and when they're added to both, they should use the same address in both live domain and persistent config, which isn't always the case today).
- What other features are needed?
You had said you wanted the qemu_* files to remain the same for now. Exactly what are you figuring on replacing? For example, for PCI addresses (since these are obviously the only ones I care about :-)) are you just planning to replace virDomainPCIAddressBus with your new virAllocator and a set of accessor functions? Even if you do nothing higher level than this, each bus will need to have a "connectFlags" that describes what type of devices can be connected to that bus (or range, or whatever you call it).
Please speak up your mind if you have remarks or thoughts about the idea. I'd really appreciate it.
Hopefully my comments haven't just made the situation more muddy :-) -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list