Palmer Dabbelt <palmer@xxxxxxxxxxx> writes: > On Thu, 01 Feb 2024 09:39:22 PST (-0800), alex.bennee@xxxxxxxxxx wrote: >> Palmer Dabbelt <palmer@xxxxxxxxxxx> writes: >> >>> On Tue, 30 Jan 2024 12:28:27 PST (-0800), stefanha@xxxxxxxxx wrote: >>>> On Tue, 30 Jan 2024 at 14:40, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote: >>>>> >>>>> On Mon, 15 Jan 2024 08:32:59 PST (-0800), stefanha@xxxxxxxxx wrote: >>>>> > Dear QEMU and KVM communities, >>>>> > QEMU will apply for the Google Summer of Code and Outreachy internship >>>>> > programs again this year. Regular contributors can submit project >>>>> > ideas that they'd like to mentor by replying to this email before >>>>> > January 30th. >>>>> >>>>> It's the 30th, sorry if this is late but I just saw it today. +Alistair >>>>> and Daniel, as I didn't sync up with anyone about this so not sure if >>>>> someone else is looking already (we're not internally). >> <snip> >>>> Hi Palmer, >>>> Performance optimization can be challenging for newcomers. I wouldn't >>>> recommend it for a GSoC project unless you have time to seed the >>>> project idea with specific optimizations to implement based on your >>>> experience and profiling. That way the intern has a solid starting >>>> point where they can have a few successes before venturing out to do >>>> their own performance analysis. >>> >>> Ya, I agree. That's part of the reason why I wasn't sure if it's a >>> good idea. At least for this one I think there should be some easy to >>> understand performance issue, as the loops that go very slowly consist >>> of a small number of instructions and go a lot slower. >>> >>> I'm actually more worried about this running into a rabbit hole of >>> adding new TCG operations or even just having no well defined mappings >>> between RVV and AVX, those might make the project really hard. >> >> You shouldn't have a hard guest-target mapping. But are you already >> using the TCGVec types and they are not expanding to AVX when its >> available? > > Ya, sorry, I guess that was an odd way to describe it. IIUC we're > doing sane stuff, it's just that RISC-V has a very different vector > masking model than other ISAs. I just said AVX there because I only > care about the performance on Intel servers, since that's what I run > QEMU on. I'd asssume we have similar performance problems on other > targets, I just haven't looked. > > So my worry would be that the RVV things we're doing slowly just don't > have fast implementations via AVX and thus we run into some > intractable problems. That sort of stuff can be really frusturating > for an intern, as everything's new to them so it can be hard to know > when something's an optimization dead end. > > That said, we're seeing 100x slowdows in microbenchmarks and 10x > slowdowns in real code, so I think there sholud be some way to do > better. It would be nice if you could convert that micro-benchmark to plain C for a tcg/multiarch test case. It would be a useful tool for testing changes. > >> Remember for anything float we will end up with softfloat anyway so we >> can't use SIMD on the backend. > > Yep, but we have a handful of integer slowdowns too so I think there's > some meat to chew on here. The softfloat stuff should be equally slow > for scalar/vector, so we shouldn't be tripping false positives there. > >>>> Do you have the time to profile and add specifics to the project idea >>>> by Feb 21st? If that sounds good to you, I'll add it to the project >>>> ideas list and you can add more detailed tasks in the coming weeks. >>> >>> I can at least dig up some of the examples I ran into, there's been a >>> handful filtering in over the last year or so. >>> >>> This one >>> <https://gist.github.com/compnerd/daa7e68f7b4910cb6b27f856e6c2beba> >>> still has a much more than 10x slowdown (73ms -> 13s) with >>> vectorization, for example. >>> >>>> Thanks, >>>> Stefan >> >> -- Alex Bennée >> Virtualisation Tech Lead @ Linaro -- Alex Bennée Virtualisation Tech Lead @ Linaro