On 05.07.2018 14:47, David Hildenbrand wrote: > On 05.07.2018 14:27, Janosch Frank wrote: >> On 05.07.2018 09:26, David Hildenbrand wrote: >>> On 27.06.2018 15:55, Janosch Frank wrote: >>>> So, after the userfaultfd fix postcopy does work now, but vSIE in >>>> combination with paging can still result in crashing g3s. Therefore we >>>> split up the series and only integrate non-vSIE support for now. >>>> >>>> Branch: >>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie >>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie >>>> >>> >>> So, after some off-lists discussions my view on things >>> >>> 1. No dirty tracking on user space tables is necessary. All KVM internal >>> guest memory accesses have to manually set dirty bits (and we seem to do >>> very well already). All user space (QEMU) accesses have to do that on >>> their own (also usually already done but we might miss some corner cases). >>> >>> 2. Splitting of huge PMDs is not necessary without vSIE and only a >>> performance improvement for migration ("dirty 4k vs 1m"). >> >> I feel uncomfortable without them. As soon as a notifier is set or >> memory is protected in a range smaller than 1m, we should split. The >> notifications should be on the smallest possible granularity and 4k >> gives us the most flexibility and space for a lot of notifier bits in >> the PGSTE. >> >> I don't want to have IN on ptes and pmds. > > As I said, we only have to split if we really protect. Notifying can be > handled on PMDs. It shouldn't really be that complicated to have two > bits and do the invalidation. Yes, we don't have to, but I think it's somewhat cleaner. It's not complicated, but for vSIE where we split we would have IN on ptes and on pmds and I don't want two possibilities for notification on IN. Or notifying IN on the pmd and then going to the ptes for notifying vSIE tables... > >> >>> >>> 3. PGSTE for split PMDs could be avoided just like for shadow GMAPs, we >>> only have to care about locking at the two notifier bits. >> >> Theoretically we can until VSIE comes along and then we need PGSTE_VSIE >> and PGSTE_IN. So we actually need them. > > You can reuse whatever software bits we have for PTEs for that. These > are GMAP-only tables. I could, but splitting is horrible enough, I want to keep split ptes as close to the normal ones as possible. > >> >>> >>> So in summary, I would really like to see a v1 without split of PMDs and >>> a v2 with lazy splitting of all PMDs. This might require some work, but >>> it should be fairly easy to get v1 implemented (in contrast to the >>> splitting magic). v3 can then take care of vSIE. If I am not completely >>> wrong, even THP could be handled fairly easy in a v1 already. >> >> I just factored out splitting and re-added IN for segments in an extra >> patch, so we actually see what the changes are before I squash it. We >> should be absolutely sure that we want it this way, because I'm >> absolutely not gonna rewrite it again. > > Better make a copy ;) I did. > >> >> Even this "simple" change completely broke huge page support, possibly >> because I forgot one single line somewhere. > > This means that > > 1. We are forgetting something important, which would be bad. > 2. Your current code is too complicated. :P > >> >>> >>> Especially v2 can handle it similar to x86: >>> >>> a) dirty tracking: when enabling, mark all PMDs in the GMAP as R/O. All >>> new *read* GMAP faults can map PMDs. All new GMAP *write* faults will >>> only create split PMDs and do this lazily (replacing protected PMDs by >>> split PMDs). So after a write fault, only one new 4k page will be mapped >>> writable and marked as dirty. We can detect if dirty tracking is turned >>> on/off using the kvm slots. We might be able to simply make global >>> decisions to simplify the implementation "one slot is set to do dirty >>> tracking, do it for all". >> >> Not a big fan of that, it means that for migration I suddenly need more >> kernel memory. > > We do that already for the dirty bitmap. And if x86 does it similarly, > this should not be that problematic. Alright, since you and Christian want it, let's do that. > >> >>> >>> b) mprotect/protect_range: As long as we have no restricted protection >>> (e.g. prefix only sets notifier bit) there is no need to split. You can >>> use a notifier bit on the PMD. Once we actually want to restrict access >>> (sub PMD protection for e.g. gmap shadow - we should forbid this >>> initially on huge PMDs), we have to split. This means, without vSIE we >>> never have to split except for a). >> >> See 2. >> >>> >>> The nice thing about this approach (if I didn't miss anything) is that >>> the PMDs with a guest prefix on them are only split if dirty tracking is >>> active (or vSIE requires it via mprotect()) - in contrast to this >>> series. And in contrast to this series, *all* memory will be dirty >>> tracked using 4k instead of 1MB pages, which makes a lot of sense - even >>> 1 CPU can otherwise set so many 1MB pages dirty easily during migration. >> >> Is this a real concern that actually breaks migration? >> > > Well, the reason for splits without vSIE from your (and Christians side) > was as far as I understood > > "prefix is always dirty, therefore we mark too many 1MB pages dirty". > > My point is that even one CPU can simply mark during one migration > iteration so many 1MB pages dirty that you run into the exact same > problems. There seems to be a good reason why x86 does it that way. > Let's trust in X, what could go wrong? :)
Attachment:
signature.asc
Description: OpenPGP digital signature