Hmmm, I doubt that. It seems very important that your data be in registers when you want to do arithmetic on it. I can see that if your data was already in registers, maybe a "randomized" instruction ordering would perform okay, but loading the data properly is time consuming. At least these are the things I've observed. Brian On Thu, 24 Feb 2005 11:43:23 -0500, Daniel Berlin <dberlin@xxxxxxxxxxx> wrote: > On Thu, 2005-02-24 at 13:48 +0100, Brian Budge wrote: > > Daniel - > > > > Yeah, that's what I meant... but wouldn't optimal scheduling be nice ;) > > > > I've been noticing this on a pentium4 (which it seemed was also what > > Richard was using). > > > > It seems like SSE would be a pretty widely used target, and that's why > > I was surprised > > to get slowdowns on even simple vector additions/multiplies/etc... > > when mixed with other code. If I ran very contrived examples, things > > ran very fast, but as soon as I put my library into an application, I > > noticed that things were slower, despite some things being calculated > > 4 times as fast. > > > > It seems that you must use the intrinsics the same way that you'd > > write the assembly in order to get decent results. > > You shouldn't have to. > The whole advantage of the intrinsics is that they are scheduled :). > > Anyway, looking at the scheduler descriptions, i don't see the p4 > including any sort of vector scheduling. > > The athlon description looks like it does. > Try -mcpu=k8 and see if it is any better. > > I should note that AFAIK, Intel's compiler doesn't actually do > scheduling for the pentium4 anymore, because it wasn't worth it. Maybe > that doesn't apply to vector instructions (or maybe the person who told > me this was wrong). > >