Hi Andrew, If you are running something on the a P4, it has 20 pipeline stages. So even if you are computing theoretically, and your whole program is a single instruction, then it will take a cache miss (>10 cycles) + 20 cycles in the pipeline. Not to worry about the other pollutants like other threads, etc... Also the variation you see is because of the other parallel threads that are running. This is more so due to caches as well. Your ld/st instruction can get stalled very easily if blocked by another ld/st instruction (from another thread) which has a cache miss and the cache is occupied! -Praveen Raghavan On 5/2/07, kernel coder <lhrkernelcoder@xxxxxxxxx> wrote:
Main idea of this project is to read packets from gigabit ethernet card and get the required information from packets in less than 200 cycles.The algorithem is so efficient that it will take less than 200 cycles if ideal envoirnment is provided.Ideal envoirnment also means that once data is brought into cache ,then it remains into cache till packet processing is complete. I think root cause of sow many cycles might be the user space.As this process is running in multitasking envoirment so there must always be continuous process switching taking place .Timer interrupt must also be disturbing the process.So i think in user space ,number of cycles consumed must always be higher and inconsistent.What do you people think ? Should this project be done in kernel space so that no other process is able to disturb it. On 5/1/07, Andrew Haley <aph-gcc@xxxxxxxxxxxxxxxxxxx> wrote: > kernel coder writes: > > > I'm doing trying to write some optimized code for AMD dual core > > opetron processor.But things are getting no where.I've installed > > Fedora 5 with 2.6 series Linux kernel and 4 series GCC > > > > Following are few lines of code which are consuming close to 100 > > cycles.Yes this is not the forum for such questions but i think people > > on linux kernel and GCC are best to answer such type of questions.I'm > > realy getting frustated and helpless ,that's why i've put question on > > this forum. > > > The overhead varies from generally 360 to 395 cycles .Sometimes it > > also reduces close to 270 cycles. > > > Cycles consumed by the targetd code varies from 20 to 100 > > cycles.Theoratically i thing cycles consumed should be less than > > 20.Then why so many cycles ? and the output vary from 20 to 100 > > cycles .Sometimes it crosses 100 cycles as well. > > Sure, but this is not unexpected. Think about pipelines and caches. > > > Sometimes the cycles consumed by targetted code become far less that > > the RDTSC instrucion overhead. > > > > Is there better way to write above code. > > I'm sure there is. Jumping out of an inline asm isn't allowed at all > in gcc, for example. We can't tell from your posting what you're > trying to do. However, measuring time intervals on the order of 10 > nanoseconds is going to be hard, whatever you do. > > Tell us what code you're actually trying to measure, and we might get > somewhere. > > Andrew. >