Some (possibly) New Ideas for RPCS3

Some (possibly) New Ideas for RPCS3
Started by Guest

39 posts in this topic

Pages (4): 1 2 3 4 Next »

Guest
Unregistered

Less than 1 minute ago -

[quote='[Unknown]' pid='210122' dateline='1399967675']
To be sure, there are i7's with 6/12 cores:
http://ark.intel.com/products/63696/Inte...o-3_90-GHz

Something I read indicated that the Cell PPU supported basically hyperthreading. And I will say, hyperthreading does work. For example, the uber slow softgpu in ppsspp runs much faster on my i7 with 8 threads than with 4, but worse with more than 8. They're not as good as two real cores, but they're way better than just one.

IMHO, writing assembly for certain routines is sometimes a very good idea. Some reasons:
* Hand coded assembly does not need to conform to the parameter passing ABI. This can have a large impact on performance in tight code (in cases where inlining might even hurt performance.)
* Optimizers / compilers are sometimes stupid. This is more relevant when targeting ARM / etc.
* If generated at runtime, it can allow you to more conveniently use/not use features based on the host CPU.
* It will not bloat the executable nearly as much as using a ton of templates would.

For example, the vertexdecoder jit in ppsspp gave great performance gains on both x86 and ARM, but it's not a recompiler - it just generates assembly instead of calling an array of functions, and ignores the C++ ABI.

That said, it's often not a good idea unless you've tried everything else first (especially for portability reasons.) A much smarter thing is to look at the assembly that is being produced, and first try to understand and resolve poor codegen by the compiler. For example, MSVC iirc will not optimize accesses to a member variable. You will often get better performance by doing this (only for hot loops):

const int x = m_x;
// tight loop
m_x = x;

Than by using x directly inside the loop. This doesn't require writing assembly to figure out, and if you think about multiple threads you might even realize why the compiler can't safely optimize it.

Anyway, moving things more low level would help in some areas for sure. Some things are very unnecessarily abstracted. The more I look at the less I know where to start to improve things.

Bigpet: to be sure, there are multiple areas. Even in my lazy approach to mapping memory, which did help a little, performance did not change much because it became dominated by the PPU interpreter (and its X vtable lookups, breakpoint checks, and thread status checks per single CPU instruction.) There are definitely multiple areas which are slow right now.

-[Unknown]

Guest
Unregistered

Less than 1 minute ago -

[quote='[Unknown]' pid='210168' dateline='1400124827']
Well, the decode step is more basic than what you're thinking. Specifically, it involves:

1. Memory fetch (what instruction is it?)
2. Byteswap (PPC instructions are in big endian.)
3. Table lookup(s) - determine encoding of instruction and packed parameters.
4. Dispatch (call the function which can handle the instruction.)

The only step above that can practically be removed on a complicated architecture is step 2. A shadow copy of the binary could be kept pre-byteswapped, and all instructions read from there. This could be a win, depending on how much time the CPU actually spends on byteswapping. I think BSWAP is quite fast.

Anyway, the table lookup is hard to avoid. You either need to pack things in different lanes and generally deal with decoding and tables, or you need to make the instructions bigger. If they're bigger, then you spend more time on step 1. It could be tested, but I would not expect it to be a win.

You can't avoid dispatch or memory fetch without actually recompiling it. Any kind of recompiler will avoid all four of those steps, and inline the actual operation - including jit.

Using AOT instead of JIT may be relevant for the PS3. Someone mentioned to me that games were forbidden from modifying executable code, which is great news. There are some cases where AOT can be tricky, though, specifically switch jump tables in some cases... but I'm not very familiar with PowerPC or how code is generated for it by compilers usually, so that may not be an issue or it may be a larger issue than I expect.

Certainly no game binary is 10GB. The size of the disc is irrelevant for AOT, only the binary matters. It will never be larger than 256MB, since the PS3 only has 256MB of RAM. However, there's likely some means of dynamically loading binaries (like dlls), and "finding" these for AOT compilation may either be very easy or tricky.

For example, the PSP does allow self modifying code, which some PSP games use, and games can even load binary code from a datafile into memory and call it. They can also call official functions to load dynamic modules, either from data files or even from memory (which may ultimately be from a compressed or packed data file.)

So, AOT makes sense if the PS3 does not have those problems, or has them in well-defined, easily detectable ways (and if homebrew doesn't need them either.)

Anyway, rpcs3 currently spends lots of time (and memory accesses) on each of the 4 steps above. Except step 2, each one involves at least one, and probably multiple, virtual method lookups (which will be cached at least, probably, but are individual memory accesses.)

-[Unknown]

Guest
Unregistered

Less than 1 minute ago -

[quote='[Unknown]' pid='210185' dateline='1400189088']
I'm sorry, but it's pretty clear you did not understand my post. I discussed binaries, dynamic loading, and also the decoding stage.

-[Unknown]

Guest
Unregistered

Less than 1 minute ago -

[quote='[Unknown]' pid='210221' dateline='1400266536']
So, let me ask you again, if you're so convinced about these things, why aren't you trying them yourself, and building proof of concept examples? Why waste all the time with this MPD silliness?

I mean, you're providing some entertainment, don't get me wrong.

But I'm thinking you probably don't believe these things yourself (if you did, you'd be researching them and trying to gain knowledge and test them), and just are hoping to send someone on a wild goose chase. That's a shame.

-[Unknown]

Guest
Unregistered

Less than 1 minute ago -

[quote='[Unknown]' pid='210304' dateline='1400390042']
Well, that is just AOT (ahead of time) instead of JIT (just in time.) I thought I read that Short Waves was doing AOT, I might be wrong.

As long as self-modifying code is 100% forbidden, AOT is probably a good idea. But either JIT or AOT will get rid of the decode stage. An intermediate representation can be created to simplify creating multiple native versions, but there's no point interpreting that IL. No matter what it should end up executing as native code for the CPU to decode in hardware.

-[Unknown]

Ontakeio
Unregistered

04-24-2014, 09:46 PM -

TL,DR; you can skip first part below if you just want to see my ideas:

I would like to first say that I have been long time observer of this emulator, and have digged in the source lots and notice the many issues that can be fixed. Also, speed seems to be problem for the future as single die chips will not sport the power needed. I have followed these forums for many months and have seen progression, but notice some simple problems can be sorted out by creating separate builds and a co-design plan to cator for higher-end machines that could handle the power of parallelism that ps3 needs.

Here are some of my ideas (more to be added) on how RPCS3 can be benefit:
1.JIT or JITIL emulation:
JIT will compiles the PowerPC code into x86 code (or whatever targets wanted), copies it into a cache, then executes it. While in cache, it run faster since access time is reduced. JITIL can be optimized and experimental, but can work better since JITIL does basically the same thing, but can compile PS3's machine code to an intermediate language before native execution and can cache some of that as well. For some FPU instructions in PS3, caching and transferring some code to an IL will benefit some game code. At the least, it should be considered since Dolphin use this and it works wonders on running some games and gives a performance boost.

2.Supercomputer optimization:
Though it is not practical, optimizing RSX and Cell's PPU parallelism between a specific option for supercomputing platforms will greatly benefit performance. I mean not talking about ten thousand dollar supercomputers and stuff, but a lighter supercomputer that can take advantage of maybe a MPI (message passing interface, a system designed for parallelism, typically in supercomputers), definitely multi-GPU and GPGPU stuff (get a build set up to work with Mantle and some low-level shaders, implement some GPGPU for off-loading the main CPU and use multi-GPU for drawing to the screen and other logical GPU for computations). Linux release could benefit a similar technology of Open MPI as well as Mac/Unix,etc. This supercomputer could be an array of several i7s connected through a MPI or such and multi-GPUs ($2000-$3000), and all tasks can be splitted up and processed by many at once, making the desired output faster. This may not seem worthy ,but it is not a total waste and will definitely be promising if done right. Any PS3 game and power should run perfect as the long run for this, and making it an option for rpcs3 with other execution core might suffice.

3.Take advantage of lower-level advantage:
Some parts of code can be optimized much better in Assembler language for the desired platforms. By doing so you reduce C/C++ overhead, even if you think your compiler can beat assembler all the time. Doing this for one subroutine is hardly worthy, but if you optimize many parts of execution (subroutines, loops, iterations, etc.) for lower-level, faster algorithms, the whole program will benefit. Some high-level C++ code can better be written in target assembly and take advantage of features C++ cannot offer. It also avoid using the C++ runtime somewhat, and can be replaced with assembly using less opcode that gets the same amount of work done with (sometimes) millions of saved clock cycles, depending on optimizing.

If this thread still open, I will come back and add a few more ideas but right now I have to go somewhere and do something.

ssshadow
Moderator

4
2,494 posts

04-24-2014, 10:10 PM -

I don't feel like I can comment on this since I don't have that deep understanding of how rpcs3 works, so instead I am going to ask, can you do this? Feel free to Wink

Bigpet
Unregistered

04-24-2014, 10:58 PM -

I appreciate the sentiment but I don't think you are correctly assessing the situation. We currently spend most of the time in our inefficient memory system. We aren't really strapped for "ideas" what we're lacking is manpower and time. But again, I appreciate the will to help.

Darkriot
Member

0
498 posts

04-24-2014, 10:59 PM -

Amm...i bad know english, and know little bit about Rpcs3, but if I understand this text, this guy wants to help Rpcs3?
(Sorry, for this terrible english)

logan
Unregistered

04-25-2014, 12:14 AM -

#10

great ideas, but you should start on github.
Start writing a code

Pages (4): 1 2 3 4 Next »

View a Printable Version

Users browsing this thread: 1 Guest(s)