Less than 1 minute ago -
[quote='[Unknown]' pid='210122' dateline='1399967675']
To be sure, there are i7's with 6/12 cores:
http://ark.intel.com/products/63696/Inte...o-3_90-GHz
Something I read indicated that the Cell PPU supported basically hyperthreading. And I will say, hyperthreading does work. For example, the uber slow softgpu in ppsspp runs much faster on my i7 with 8 threads than with 4, but worse with more than 8. They're not as good as two real cores, but they're way better than just one.
IMHO, writing assembly for certain routines is sometimes a very good idea. Some reasons:
* Hand coded assembly does not need to conform to the parameter passing ABI. This can have a large impact on performance in tight code (in cases where inlining might even hurt performance.)
* Optimizers / compilers are sometimes stupid. This is more relevant when targeting ARM / etc.
* If generated at runtime, it can allow you to more conveniently use/not use features based on the host CPU.
* It will not bloat the executable nearly as much as using a ton of templates would.
For example, the vertexdecoder jit in ppsspp gave great performance gains on both x86 and ARM, but it's not a recompiler - it just generates assembly instead of calling an array of functions, and ignores the C++ ABI.
That said, it's often not a good idea unless you've tried everything else first (especially for portability reasons.) A much smarter thing is to look at the assembly that is being produced, and first try to understand and resolve poor codegen by the compiler. For example, MSVC iirc will not optimize accesses to a member variable. You will often get better performance by doing this (only for hot loops):
const int x = m_x;
// tight loop
m_x = x;
Than by using x directly inside the loop. This doesn't require writing assembly to figure out, and if you think about multiple threads you might even realize why the compiler can't safely optimize it.
Anyway, moving things more low level would help in some areas for sure. Some things are very unnecessarily abstracted. The more I look at the less I know where to start to improve things.
Bigpet: to be sure, there are multiple areas. Even in my lazy approach to mapping memory, which did help a little, performance did not change much because it became dominated by the PPU interpreter (and its X vtable lookups, breakpoint checks, and thread status checks per single CPU instruction.) There are definitely multiple areas which are slow right now.
-[Unknown]
To be sure, there are i7's with 6/12 cores:
http://ark.intel.com/products/63696/Inte...o-3_90-GHz
Something I read indicated that the Cell PPU supported basically hyperthreading. And I will say, hyperthreading does work. For example, the uber slow softgpu in ppsspp runs much faster on my i7 with 8 threads than with 4, but worse with more than 8. They're not as good as two real cores, but they're way better than just one.
IMHO, writing assembly for certain routines is sometimes a very good idea. Some reasons:
* Hand coded assembly does not need to conform to the parameter passing ABI. This can have a large impact on performance in tight code (in cases where inlining might even hurt performance.)
* Optimizers / compilers are sometimes stupid. This is more relevant when targeting ARM / etc.
* If generated at runtime, it can allow you to more conveniently use/not use features based on the host CPU.
* It will not bloat the executable nearly as much as using a ton of templates would.
For example, the vertexdecoder jit in ppsspp gave great performance gains on both x86 and ARM, but it's not a recompiler - it just generates assembly instead of calling an array of functions, and ignores the C++ ABI.
That said, it's often not a good idea unless you've tried everything else first (especially for portability reasons.) A much smarter thing is to look at the assembly that is being produced, and first try to understand and resolve poor codegen by the compiler. For example, MSVC iirc will not optimize accesses to a member variable. You will often get better performance by doing this (only for hot loops):
const int x = m_x;
// tight loop
m_x = x;
Than by using x directly inside the loop. This doesn't require writing assembly to figure out, and if you think about multiple threads you might even realize why the compiler can't safely optimize it.
Anyway, moving things more low level would help in some areas for sure. Some things are very unnecessarily abstracted. The more I look at the less I know where to start to improve things.
Bigpet: to be sure, there are multiple areas. Even in my lazy approach to mapping memory, which did help a little, performance did not change much because it became dominated by the PPU interpreter (and its X vtable lookups, breakpoint checks, and thread status checks per single CPU instruction.) There are definitely multiple areas which are slow right now.
-[Unknown]