By the way, I've just re-wrote the code to target the partitions individually and I've got almost 4 times improvement.
Shouldn't it be faster to process the trigger, I would understand if there was no CPU left, but there is lots of cpu to chew.
Once you turned off hyperthreading, it was reporting 75% CPU usage. Assuming that that accounting is perfect, that means you could only get 33% faster if you were to somehow start using all of the CPU. So I don't think I'd call that a lot of CPU left. And if you have 70 processes fighting for 8 cores, I'm not surprised you can't get above that CPU usage.
It seems that there will be no other way to speedup unless the insert code is partition aware.
There may be other ways, but that one will probably get you the most gain, especially if you use COPY or \copy. Since the main goal of partitioning is to allow your physical storage layout to conspire with your bulk operations, it is hard to see how you can get the benefits of partitioning without having your bulk loading participate in that conspiracy.
Cheers,
Jeff