Dylibs vs Multi-process loading in the editor


(David LeGare) #1

This post is for some initial discussion around how we can handle loading the game code within the editor. Details are summarized and archived on the wiki

We need to be able to run the game within the editor somehow, and there are two main approaches to doing this.

  • Compile the game as a dylib and have the editor host it and access the game’s state directly
  • Run the game in separate process from the editor and communicate the game’s state over ICP.

Requirements

  • Editor must be able to access game state.
    • This includes things like entities, components, resources, State objects, information about systems, loaded game assets, and anything else the developer might want to see about their game when running.
    • Access must be performant even for large games. If the game has 100,000 entities, we need to be able to efficiently query and display state information for all of them.
  • Hot-reloading of game code. Ideally, a developer should be able to modify some code, recompile, and then test their changes without needing to restart their game.

Dylib Loading

The biggest advantage here is that it is relatively low overhead: The editor gains direct access to the game’s state, and therefore doesn’t need to keep track of the game’s state separately. This is a big win in terms of memory usage and CPU usage.

The biggest disadvantage has to do with robustness, in that an error in the game code that causes the game to crash can also cause the editor to crash. If the game simply panics, we can likely allow the editor to catch the panic and recover smoothly, but if the game segfaults or e.g. call std::process::abort, then the game could cause the entire editor to crash. In order for the dylib approach to be viable, we must have a solution that allows the editor process to remain running in the face of any kind of errors in the game code.

Multi-Process Architecture

This is almost the exact opposite of the dylib approach: It trades greater runtime overhead for guaranteed robustness in the face of any errors in game code.

The two biggest costs here are in latency and memory overhead:

  • There’s a delay in sending the updated state data from the game to the editor. The actually latency here is likely minimal (so small that a user likely can’t perceive it), however we do need to rate limit how often the data is sent. This rate limiting does result in a noticeable stutter in the editor.
  • The editor needs to maintain a copy of all the state data for the game, updating that data as the game process streams updates to the data. This process is costly in terms of memory, and is likely not going to be viable for medium to large sized games.

One potential improvement here would be to not maintain a copy of the game’s state, but to instead have the editor directly query game state over ICP. This would likely mean that the game would end up keeping some internal state that’s relevant to the editor, e.g. if the player selects an entity, the editor would then request that the game stream all the data for that entity and its components, and the game would need to internally track which entity is currently selected in the editor.


(Khionu Sybiern) #2

I think worrying about resource usage with this is a little premature. Even Star Citizen currently doesn’t use more than 6 GB on my computer. So, assuming a game will stay below 4-6GB, and using the selective sync process, the developer should be fine assuming they have at least 8 GB to dedicate to the game dev process. And that’s assuming high end resource usage. Most indie games won’t be going over 2-3 GB, so they could do with having 8 GB on their computer, total.

I mentioned this in DM over Discord, but we can do batch updates. It would be a single write batch and a single read batch per dispatch.

I’m very much in favour of the second option, as the handling crashes aspect of the first option would create a lot of effort in maintaining edge cases that still crash the editor in the future, while the latter could simply reconnect.


(Kae) #3

I am strongly in favor of a multi-process architecture. My primary reason for this is that reading directly from the game engine’s data structures requires read access to that data, which means no concurrent mutable access. It does not allow for the game loop to run concurrently with the editor. The editor is thus forced to run synchronously with the actual game engine frames to read and render its UI, or do other work.

I suggest that we learn from distributed computing architecture. It’s common to put contended data in a database and put your logic outside of the database in a stateless request handling layer. I propose a similar architecture: put a query engine in the game engine. Make it possible to query game state over RPC that is handled in the main game loop.

Amethyst is structured in a way that makes this quite viable: specs contains all World state. It’s possible to iterate over resources and component storages dynamically since they are all known. All that is really missing to be able to implement for example an SQL query planner & executor is struct reflection of the types in the components and resources. I am very positive that the performance of such a query engine would be very good.


(Théo Degioanni) #4

This could be implemented (painfully) with “remote” and “server” features on shred and specs that would clear out the implementations for all World stuff and redirect it to the remote server instance.
However I have some question: how do would you ensure consistency? If the game grabs a join over some storages and will mutate some stuff in it, how do you send it back to the database? What if the editor raced a mutation in the process?


(Kae) #5

Do you mean just exposing the shred and specs APIs over RPC? Such an implementation would not allow to push predicates or joins into the game engine which may end up increasing the bandwidth requirements drastically for many use-cases.

Long-lived interactive transactions are not possible to run concurrently with the game in the shred/specs data model. Interactive transactions will need to complete synchronously with the frame loop, meaning it blocks until the transaction completes.


(Khionu Sybiern) #6

Quite simply, Barriers. The Writes happen before absolutely everything else, the Reads happen after everything else.


(Théo Degioanni) #7

Well isn’t that what you suggested? Basically having a remote World?

But what if the transaction doesn’t complete?


(Khionu Sybiern) #8

This would need to be handled as any desync. Another attempt would need to be made.


(Kae) #9

I meant literally implementing an SQL query execution backend for specs. So you can run SELECT/UPDATEs over resources & components where the schemas presented are based on the reflection data. For example, Transform would be a table


(Khionu Sybiern) #10

I just realized ECS makes for a relatively clean structure of data to store in a SQL DB. Aside from cases where a Component references an Entity, it’s just a bunch of One To Many relationships (I’d say Many to One, but not sure that’s an accepted description for SQL Relationships)


(Théo Degioanni) #11

No, I meant what if the game hangs while it holds the lock.
If you use a database enforcing strong locks, then that means you can potentially make the editor hang.
If you use a database with droppable transactions, you cannot possibly replay the transaction on the game’s side.

I believe a solution would be to allow the editor to locate that sort of locked data, signal it and offer to kill the system/game that locks it.

This is what I had in mind, I probably did not word it properly.


(Khionu Sybiern) #12

It would be the responsibility of whoever designs the infrastructure to ensure that the process is done efficiently enough for that to not be an issue. You’re putting the cart before the horse. Any system could potentially hang the game.

Really, it should offer that if the game is held up for too long, regardless of reason.


(Kae) #13

You don’t need any locks with the concurrency model I proposed. You can time out the transaction after a certain amount of time when commands are not being received since blocking the game loop for extended periods of time is not how you should write your editor code


(Théo Degioanni) #14

That’s exactly what I am talking about.

Yeah okay I guess this is the solution I thought about afterwards.


(Khionu Sybiern) #15

While this infrastructure code shouldn’t hang the system, we shouldn’t try to make sure that the editor connection isn’t capable of hanging the game. It could be a feature, essentially implementing a breakpoint over RPC. While they could do this through their code editor of choice, there are cases where the artist might want to do so, as well, ensuring animations as smooth in context and such.

Furthermore, I don’t like the idea of going too far to ensure the editor doesn’t lock the game. Even for regular operations with the editor, we should accept that the game is going to get a decent, potentially noticeable, amount of overhead. I’d worry more about integrity of the data shared, in this scenario.

Actually, it would save on infrastructure if the editor was the database, for development purposes, and ensure the editor’s state can be relied on.


(Théo Degioanni) #16

This is the situation I was talking about earlier, and it cannot work.
But if the game is the database, there is no realistic reason for any hanging.
Just compute the database request during world maintaining.


(Kae) #17

Yes, sorry if I was not clear. Editor process would not need to run specs. The game process would act as an SQL server and the editor process would act as an SQL client.


(Jacob Kiesel) #18

There’s a third approach that I believe has the merits of both.

Use the first solution and then run the user code in a separate thread from the editor. This way we don’t need to build and maintain a state synchronization solution for our editor. If the thread crashes we can just restart it.


(Kae) #19

I don’t believe this solution solves the following, which is IMO essential for fulfilling our values of scalability and performance.


(Jacob Kiesel) #20

To be honest I don’t understand why that’s a problem. What use case are we trying to accommodate that such an architecture wouldn’t work with? I believe if anything performance would end up being worse due to all the overhead of synchronizing different sets of memory.