Intrigued by Star Citizen's server meshing, replication layer and persistence tech, I wanted to tinker around with the idea myself. In this project I attempt to build a world in Unreal Engine that is simulated by multiple servers, without using built-in multiplayer.
Project Introduction
Oh boy. Where should I start with this one. It is by far my favorite project that I have worked on, ever. Let me explain..
It all started with my favorite game, Star Citizen. This project and the people behind it are pushing the limits of what is possible, which is something I've always enjoyed in programming (and with my own skills).
One of the major tech features that they are working on is server meshing. In simple terms, this tech will make it so that the simulation of the game world is divided over several game servers. If that's still a little too techy for you, let me put it this way;
Imagine you are playing Minecraft on a server with your friends. Every block you mine, every step you take is simulated by the game server that you connect to, and the results of the simulation are distributed to everyone online (including you).
I am sure that you've ran into situations where the server would get laggy, either because you're blowing up too much TNT or maybe there's just a ton of people online spread out over the game world.
With server meshing, many issues because of a single server running out of compute power will hopefully be solved. But this is of course easier said than done.
Now, I don't know much in terms of the actual server meshing implementation that Cloud Imperium Games is working on, other than the limited public knowledge floating around. But what we do know about the tech seems very interesting to me, which is why I wanted to mess around a bit with the idea myself. Server Meshing in Unreal Engine
Unreal Engine has one of the most easy to use built-in multiplayer stacks that I've come across. I used it to make a small prototype of a project I was working on, and without too many issues I was able to have some nice basic multiplayer with a world on a server, and players that could walk around and drive vehicles.
For most people this would be all that they need. But with the concept of server meshing in mind and seeing as Unreal doesn't support anything other than connecting to a single dedicated game server with under-the-hood networking logic, I decided to create a multiplayer system from scratch without using any of that built-in logic.
That is what this project is all about. Seeing how far we can push using Unreal Engine, my relatively new C++ experience and a whole lot of motivation to build something that people would usually not even bother to try because of the massive scope.
Dawn Of The First Prototype
Check out the video below for the result of this prototype. Little explanation for context;The blue capsule is the last position we received from the server for this entity. The red one shows the last server position that we locally processed.Now for the debug info text in the top left of the screen. You can ignore most of it. But take a look at the 'authorative server id' as we walk into the room. If you look closely, its ID will match the one we see on the label for that zone in the world.Also yes, the blue cubes you see are also synced and were a big help in getting the sync down. I don't have any videos laying around of that part of the process sadly.
Alright so for the first prototype I started off by establishing what exactly I would need in order to make the server mesh work.If I would define the global idea in a single sentence, it would be something along the lines of: "The ability to simulate parts of the game world on separate servers, all the while providing an experience so seamless that players would not even know that there's any other servers involved." Another thing that was on my mind. While we could technically divide the game world up in separate zones during development and hardcode servers to simulate specific coordinate-based zones, I felt that this would potentially heavily restrict what I could do with the map (and this system) in the future.That's why I came up with a solution (that may or may not be based on Star Citizen's dynamic server meshing...), of streaming containers. If we look at the world as a hierarchical tree, we could define each node that has children as a 'streaming container'.Let's say we have the start of our tree hierarchy, this is the 'world origin' or 'world parent'. Everything we spawn inside the world will be a child node connected to the world origin node.Now if we could program the system in such a way, that a server can receive responsibility over a node within the world hierarchy. All child nodes under that simulated node would then be simulated by said server. And with a small tweak, we can also make it so that a server's responsibility stops if it encounters a node that is already simulated by another server. In a very backwards fashion, that is the first thing I started to develop. The concept of these streaming containers with their entities, all sharing a base class that contained the bare necessities of an entity that could exist in our world (and be used in this system).Because I decided in advance that I wanted to split the codebase up into several parts based on their responsibilities, I wrote the shared code (containing these base classes) in a 'vanilla' C++ static library project. Vanilla meaning that it did not use any datatypes provided by Unreal Engine. Let's fast forward just a little bit. Because if I'd go through my entire detailed thought process we'd still be here ten pages later.Once I got the prototype where I needed it to be to perform some basic tests, I created the following:
A C++ multiplayer framework built from scratch, set up in such a way that it could be compiled on both UNIX/Windows (roughly based on the design of StyloNetCore, which is my C# multiplayer library that is used in Stumble Upon Rumble)
An Unreal Engine project that housed both the client logic and server logic, but could be built into both a client and server build separately (so excluding any server data from the client)
A pure C++ console application utilizing my C++ multiplayer framework, functioning as the glue between the game and all the game servers (this would be the replication layer in Star Citizen probably)
A shared static library containing data shared between the replication server, dedicated game servers and game client
Dockerized neo4j database to store all the persistent world data in, including a custom build of libneo4j-omni for specific operating systems (this is used by the replication server to load data from the persistent database into memory, and send commands to the neo4j database)
The game was very limited in terms of features. My goal was to write a player controller as soon as I got all the other networking stuff taken care of (connecting to servers, syncing data etc etc). So with that player controller, we were finally able to move around the game world and test the server mesh.And to my surprise... it worked? I mean the sync was very primitive, using fixed timesteps to do limited movement (because integrating Unreal's physics system with the custom fixed timestep was a little out of scope for what I was trying to do here). But the authority was transfered successfully and seamlessly as I moved between areas of authority.Words cannot describe how crazy it was to see this work, even going strong when hosting the replication server and neo4j database on an AWS EC2 instance to better test out latency (and multiple players), while the game servers connected to the replication server were running from my home (so that would cause even more latency). Very abstract, I know, but this is the debug data I was observing while quickly walking between the zones of authority. That GUID is the entity ID that I was controlling at the time. Left is server A, right is server B. Overjoyed as I was.. I knew this wasn't going to cut it. Thinking ahead, even implementing the easiest next step such as the player jumping or properly falling using physics wasn't going to work with the fixed timestep that desperately tried to avoid the physics system. But, we had a resultAnd that result was that yes, the design I made here actually has a chance of working! So all I had to do now was find a better way to sync entities and their physics over the network (to get the basics working anyway) while somehow working a little more nicely with Unreal's physics engine.I think I spent in the range of 2-3 weeks to plan this out, prepare dependencies and write all the code.Little summary of what we have now, after finishing this prototype;
Multiplayer game setup that has a game client, replication server and dedicated game servers
Master game state stored on the replication server, distributed to servers and clients alike from there
Neo4j database that stores persisting world data (entities pretty much)
Game server crash recovery; thanks to the replication server a fresh dedicated game server can pick up where the old server crashed after feeding the relevant gamestate data into it
Semi-dynamic server mesh, in the sense that we can spin up servers at will and give them control over any streaming container in the world
Player that can walk between zones of authority without noticing it themselves
And finally.... even more motivation! I can't wait to continue prototyping the whole idea.
Rise Of The Second Prototype - The Meshening
Check out the video below for the result of this prototype. Apologies in advance for the annoying camera movements, the pitch was still inverted in this build, and when alt-tabbing to terminate/start the servers it would move to a suboptimal perspective. A little context..The world was divided in two zones, basically inside the cube outline and everything outside. There were two dedicated game servers running on AWS, one simulating the inside of the cube outline and one simulating the outer area. On a separate EC2 instance the replication server was running, which is the server we connected to with game clients (there's even a firewall in place between us and the gameservers, so we're never directly connected to those).Debug info in the top left of the screen shows our current authority status, the server's FPS (basically update ticks it manages to do per second). The target FPS is set to 50 on the server. It also shows the current tick, but that is irrelevant for the test in the video.If you check 0:38 in the video, you can see the authority transfer in action. I realize it is not as smooth as you'd want it in a final product, but for the prototype this feels as a win.To see crash recovery in action, check 0:46. If you keep an eye on the authority text in the top left, you'll see when the server comes online and our authority and simulation is picked up by a fresh DGS. Since the world and amount of data is so small, recovery is pretty much instant from the moment the server is turned on (as you can see it already started before I was able to alt-tab back in).
First things firstFor this prototype I decided to take a more calculated approach in terms of planning. While the previous prototype was more like 'build all of these items as fast but acceptable as possible' to have a testable project, I started the second prototype off with the essentials. After all, we know that the general direction we went for in the first one worked.The first item on the list was the `SocketCommunication` class. In summary, this class would house all of the code and routines needed to establish, maintain and use connections to endpoints.Because we love reusability in this line of work, this class is set up in such a way that it can be used on all three targets (client/replication server/dedicated game server). It covers connection initiation, watching for timeouts and processing packets into usable object instances that can be used by the different game behaviours. Project structureWith the first prototype I relied on separate codebases and applications to do their part in the whole process. A separate replication server (in plain C++ with the custom multiplayer library) and the client/DGS both built from the Unreal project.To make iteration (and shared code) even more efficient, I decided to build all three targets in the Unreal Engine project. Through the use of modules, the code for client and servers was neatly separated. Not only did this feel cleaner, it also prevents the server code from (accidentally) being referenced or compiled in client builds.Nice little extra is that the replication server now has easier access to some game world properties. So now, even when only the replication server is up, player/vehicle spawn requests can be queued up so that they are processed as soon as a DGS comes online. Building the system - initial goalWith this out of the way, I set the next goal for the process. This was to have a better version of a controllable player synced over the network. As such, I've limited the client-server communication to just a single dedicated game server, which allowed for much quicker iteration and verification of what was built.Instead of implementing the fixed timestep simulation like the first prototype, I picked the state interpolation method. Of course, ideally I'd stick to the fixed timestep simulation. Though for now, especially since I really want to use UE5's physics system (and all the components that use it like Chaos Vehicles and the player movement), this was the way to go.In addition to that, I ensured that the server simulation was actually running ahead of the game client. Now I can hear you think; Well duhh. Didn't you do that in the previous prototype? Valid question. I didn't actually, which may have also contributed to some of the issues I experienced along the way. Working entity syncAfter some tinkering and many cups of coffee later, I ended up with a version that met the initial goal that I set. Because the simple sync was so similar between players and vehicles, I bit the bullet and implemented those as well.Using this version, I was able to verify that the sync was acceptable the way I built it. In short, we capture and activate user input each 'fixed tick'. This fixed tick is synced between clients and servers. In the usual 'tick' (UE's actor update loop basically), we apply the input to the respective UE components.Also happening each fixed tick, is the client sending the input information to the server. The server will then follow the same pattern of applying the received input in the correct fixed tick. The server will then also distribute all simulated states to clients.When clients receive the authoritative state from the server, they add it to a circular buffer indexed by server tick. This way, when the client reaches the tick of the simulated state, it can apply it as the active state. Clients will then interpolate on the active state, based on an alpha that is calculated by discrepancy between local/authoritative state and state age. Splitting of duties - replication serverNow that we've verified that the sync works, we can start splitting the simulation logic and the replication logic. If anything we do now messes up the experience, we'll know that it is not so much the sync but rather the way we've split replication and simulation. Which should in theory make it easier to troubleshoot and iterate on the correct systems.What I've done to make this split as easy as possible, is store entity replication data in a separate struct. This means less explicit copying/updating all around, and gives us an easy overview of what is exactly replicated for entities.Though honestly a large part of this process was cut/pasting some specific logic like keeping track of connected clients, sending entity replication data to clients and all that good stuff into the replication server behaviour. Because now that we have the split, dedicated game servers only have to communicate with the replication server and have no real idea of what clients are connected.New code was mainly the communication of replication data between DGS and replication server, and defining authority. Because normally, a server should not need to send replication data for an entity they don't have authority over. The replication server regulates authority based on what streaming container an entity is in.Streaming containers are just like what you'd expect from the name. They're virtual cubes in the world that define a space (and they can be nested too). A DGS gets assigned a root streaming container after they connect (or after a new streaming container that needs a server becomes available).The authority for that server then begins with the root streaming container, and applies to all entities within that container. Though, if there is a nested streaming container that has its own DGS, the authority will stop there. This way, we can even have servers simulate zones that are located in other servers' authority zones. So with all this now up and running, the replication server collects all the entity states from authoritative servers and distributes the new states every fixed tick. I should say that the states are distributed to all relevant clients, but also all relevant DGSes (with the exception of the authoritative server, because it already has the latest state).Like I mentioned earlier, the replication server also handles authority management for entities and servers. If a streaming container goes from being just a plain container, to a container that needs its own server, a standby DGS is assigned to it. In the next authority evaluation tick any entities in that container will start an authority transfer. Authority transferFor the authority transfer I built a very simple transition system that felt appropriate for the current project. The entity's authority state will go from 'established' to 'transferring', and a transition window will begin. During the next X ticks, crucial data like input and latest state are directly forwarded to the new server to get it in sync with the current authoritative server.After the transfer window closes, the entity authority state switches back to 'established', but now with the new authoritative server. This means that the old server will no longer send state updates (and if it does they are rejected), and things like input are now only sent there. Crash recoveryThis system allows for a very cool feature. Because authority is constantly being evaluated, imagine a nested server goes down. Upon the next authority evaluation, the replication service will now see that the established authoritative server is no longer the correct one.It then checks whether the established authoritative server is still online. If it is not, it will immediately switch over to the correct server instead of transitioning. Because, you know. There isn't much to transition from if the server is gone.While this may cause minor hiccups, it prevents the gamestate from being completely lost. The only way for the gamestate to be lost (and for players to be disconnected), would be for the replication server to go down. So it is crucial that it is built as robust as possible. So now I hear you ask. What if all dedicated gameservers go down? Well here's the cool part. As long as the replication server is still online, clients remain connected and the gamestate remains saved. Sure, you won't be able to interact with anything and you don't see anyone move, but the state is intact.As soon as a DGS connects to the replication server, the current state is seeded to it. As soon as that is completed, the authority evaluation I talked about earlier will kick in and immediately transfer authority to that server. That will resume simulation from the last known gamestate.If players have moved at all (or if other entities in their game have moved from their last position), they will experience a short snap back to the correct position and state. But after this, gameplay continues as normal. The end resultThe second prototype was a blast to work on. We have controllable cars and characters (white capsules, but still, you can walk and jump), distributed simulation load, a replication server keeping track of authority and gamestate and crash recovery.While this prototype doesn't store the actual gamestate in any persistent database yet, it proves that the current design is very usable and heading in the right direction. I have actually already started to implement this system in an undisclosed project that I started working on last year. So perhaps I'll be able to share any advancements related to this in due time.