Re: AMD dual core opetron optimization

"Praveen Raghavan" <praveenr@xxxxxxxxx> · Wed, 2 May 2007 09:14:38 +0200

Hi Andrew,

If you are running something on the a P4, it has 20 pipeline stages.
So even if you are computing theoretically, and your whole program is
a single instruction, then it will take a cache miss (>10 cycles) + 20
cycles in the pipeline.
Not to worry about the other pollutants like other threads, etc...
Also the variation you see is because of the other parallel threads
that are running. This is more so due to caches as well. Your ld/st
instruction can get stalled very easily if blocked by another ld/st
instruction (from another thread) which has a cache miss and the cache
is occupied!

-Praveen Raghavan

On 5/2/07, kernel coder <lhrkernelcoder@xxxxxxxxx> wrote:
Main idea of this project is to read packets from gigabit ethernet
card and get the required information from packets in less than 200
cycles.The algorithem is so efficient that it will take less than 200
cycles if ideal envoirnment is provided.Ideal envoirnment also means
that once data is brought into cache ,then it remains into cache till
packet processing is complete.

I think root cause of sow many cycles might be the user space.As this
process is running in multitasking envoirment so there must always be
continuous process switching taking place .Timer interrupt must also
be disturbing the process.So i think in user space ,number of cycles
consumed must always be higher and inconsistent.What do you people
think ?

Should this project be done in kernel space so that no other process
is able to disturb it.

On 5/1/07, Andrew Haley <aph-gcc@xxxxxxxxxxxxxxxxxxx> wrote:
> kernel coder writes:
>
>  > I'm doing trying to write some optimized code  for AMD dual core
>  > opetron processor.But things are getting no where.I've installed
>  > Fedora 5 with 2.6 series Linux kernel and 4 series GCC
>  >
>  > Following are few lines of code which are consuming close to 100
>  > cycles.Yes this is not the forum for such questions but i think people
>  > on linux kernel and GCC are best to answer such type of questions.I'm
>  > realy getting frustated and helpless ,that's why i've put question on
>  > this forum.
>
>  > The overhead varies from generally 360  to 395 cycles .Sometimes it
>  > also reduces close to 270 cycles.
>
>  > Cycles consumed by the targetd code varies from 20 to 100
>  > cycles.Theoratically i thing cycles consumed should be less than
>  > 20.Then why so many cycles  ? and the output vary from 20 to 100
>  > cycles .Sometimes it crosses 100 cycles as well.
>
> Sure, but this is not unexpected.  Think about pipelines and caches.
>
>  > Sometimes the cycles consumed by targetted code become far less that
>  > the RDTSC instrucion overhead.
>  >
>  > Is there better way to write above code.
>
> I'm sure there is.  Jumping out of an inline asm isn't allowed at all
> in gcc, for example.  We can't tell from your posting what you're
> trying to do.  However, measuring time intervals on the order of 10
> nanoseconds is going to be hard, whatever you do.
>
> Tell us what code you're actually trying to measure, and we might get
> somewhere.
>
> Andrew.
>