A Novel Design Approach for Online Game Server Engines
Is there a game server architecture that is distributed yet as simple as writing single-threaded programs, where data is automatically pushed, and unnecessary APIs are hidden, allowing focus solely on logic?
Genesis
Game servers were initially developed using C, which involved slow cycles and painful debugging. Later, LuaJIT was adopted. We were quite aggressive, switching as early as 2009. After all, online games prioritize performance, but in reality, the ease of development and flexibility of dynamic languages are more important than raw performance.
During continuous feature improvements and refactoring, we realized the code workload was still heavy and couldn’t keep up with increasingly fast iteration demands. Many features couldn’t be experimented with and tested immediately. To improve, we needed to change languages again and redesign the game server from a conceptual level.
Server as Database
Traditional architectures involve the game server receiving client instructions, processing logic, then manipulating the database, and finally returning updated data to the client. This process itself introduces complexity. So why not simply hide the database, or rather, integrate it as one? The game server itself becomes the database.
This idea is quite natural. Online games have always disliked relational databases, opting for simple memory mapping for efficiency and speed, or NoSQL solutions like Redis. Furthermore, the best database is always a custom one, as different business needs require different trade-offs. So, why not design the server as a specialized database interface serving game clients? Of course, the backend would still rely on a real database for persistence, but database operations are completely hidden and handled internally automatically.
Hereafter, the game server will be referred to as db, and the actual backend database as backend.
- First, game clients can directly subscribe to queries (
select ... where ...) on thedb. Any data changes within the subscription will be automatically pushed by thedb, including events likeon_insert,on_update,on_delete, etc. - Tables should have simple permission settings, allowing clients to only query data relevant to them. However, some tables should be set to guest permissions, such as server online status, allowing even unlogged-in clients to query.
- Clients can only subscribe, they have no write permissions. Writes must go through the
dblogic code, which is the traditional game server logic. Since it’s adb, you could also call it stored procedures. Clients directly call by name, and thedbexecutes the corresponding function. In most cases, there’s no need to return execution results because they will be automatically reflected via pushes. - Read/write operations within the
dblogic code must be wrapped in a transaction, then automatically committed to thebackend. If a transaction conflict occurs, it is automatically retried. This way, developers don’t need to consider race conditions during writing, and deployment becomes more flexible.
graph TD;
subgraph "Game Server (DB)"
DB1["Subscription"];
DB2["Logic"];
end
Client["Game Client"]<--Data Stream (Sub)-->DB1;
Client--Call-->DB2;
subgraph "Backend"
DB1<--Read-Only Subscription-->Replica;
DB2--Read-Write Transaction-->Master;
end
This form makes development very comfortable for both client and server sides. Clients only need to subscribe to data, and UI or objects in the scene can be automatically handled through data events, largely decoupling from server logic. The server side can focus solely on writing the actual logic, no longer needing to worry about sending any updated data to the game client.
Feature Trade-offs
This primarily involves balancing a variant of the impossible triangle: performance, flexibility, and data consistency. Improving one aspect tends to lower another.
First is performance. A load-balanced distributed structure similar to web servers can be adopted. Therefore, performance is only constrained by the backend database. So using a high-performance solution like Redis, which also has built-in message queue capabilities, eliminates the need for an extra layer. Redis can scale horizontally to improve performance, but this involves trade-offs with data consistency. Finally, Redis is inconvenient for data maintenance and presentation. Usually, Redis is only used as a cache, with another layer like MySQL added for persistence. However, this adds significant complexity and greatly reduces flexibility. Redis is capable of persistence and can implement indexing. Why not just use Redis alone? I plan to address data maintenance and presentation later using data science-related libraries.
Flexibility includes aspects of failure and maintenance. Here, high availability is sacrificed. When errors occur, clients are directly disconnected and connect to another server. Games can implement seamless reconnection or simply require re-login. For maintenance, the best approach is one-click server deployment, facilitating dynamic scaling, which can significantly reduce costs. Automatic failure recovery can be achieved by setting up auto-restart, which requires good data consistency so that no data rollback/repair is needed during failures.
Finally, games have high requirements for data consistency. Firstly, how to solve data race conditions in distributed systems? Secondly, how to handle half-written data during program crashes/server downtime? Situations like deducting a player’s money but not delivering the item are unacceptable and must be avoided. The best method is to achieve this through backend transactions. However, transactions significantly reduce backend performance; even for Redis, it can reduce performance by two orders of magnitude. If not implemented through transactions, due to the involvement of indexing, locks or other methods would be required. These are very troublesome to write and would also sacrifice all maintenance flexibility. Therefore, transactions are chosen here, sacrificing backend performance. Then, the backend can support horizontal scaling while maintaining consistency by splitting transactions, which will be discussed later.
---
title: Architecture Diagram
---
graph TD;
Client1["Client"]-->DNS;
Client2@{ shape: processes, label: "Clients..."}-->DNS;
DNS["Load Balancer DNS"]-->Node_A;
DNS-->Node_B;
subgraph "Node1"
Node_A["Game Server"]-->Worker_A1["Worker"];
Node_A-->Worker_A2@{ shape: processes, label: "Workers..."};
end
subgraph "Nodes..."
Node_B["Game Servers..."]-->Worker_B1["Worker"];
Node_B-->Worker_B2@{ shape: processes, label: "Workers..."};
end
Worker_A1-->Backend["Backend</br>(Redis Cluster/Replica)"];
Worker_A2-->Backend;
Worker_B1-->Backend;
Worker_B2-->Backend;
Language and Further Performance Considerations
Actually, compared to modern languages, Lua is not that convenient; even many features of C++11 are more comfortable than Lua. Coupled with Lua’s scarcity of libraries and lack of maintenance, phasing out Lua for game servers is an inevitable choice.
Rust is a good option, meeting requirements for safety, performance, and modernity comprehensively. However, I lean towards dynamic languages, preferably one that is very convenient to write, like Python. In fact, as long as the system supports distribution, the bottleneck is entirely pinned on Redis. The performance of the language itself doesn’t matter much. CPUs are cheaper than people now. If server maintenance is convenient, using spot instances can further reduce costs by 50%. People even accept the 30% performance reduction from Docker, prioritizing convenience.
Python has many libraries. For example, table reading/writing can be completely wrapped using NumPy arrays. Processing and filtering arrays become very handy. For example, cross-indexing:
array = money_table.query('last_update', left=now - 3600, right=now)
poor = array[array.money < 999]
poor.money += poor.money.mean()Perform secondary filtering locally, selecting all rows where money is below 999, then process data vectorially.
This Fortran-style broadcasting/vectorization is very suitable for data processing patterns in games. Vectorization automatically enables SIMD, making it dozens of times faster than C’s for loops. For multi-index queries, MySQL also performs secondary filtering on the CPU, without even using SIMD. This wrapping also solves the previously mentioned inconvenience of Redis data maintenance. Not just maintenance, but also tasks like report generation and income analysis become easier. After all, data processing and presentation are Python’s strengths.
For server NPC AI, we previously used behavior trees, state machines, etc., written manually. With Python, possibilities become limitless. Real AI model approaches can be fully implemented, such as reinforcement learning (Q-learning, etc.). They deduce the value of past behavior timings through future rewards, brute-forcing the optimal NPC behavior logic. Of course, bosses requiring high gameplay quality need more tuning, as reasonable behavior doesn’t necessarily mean fun.
Redis Performance Bottleneck and Data Structure Design
Redis performance limits the processing capacity of the game server. Fortunately, large-scale MMOs are less common now, and online populations under ten thousand won’t hit Redis’s transaction limits. Redis’s single-index + transaction capability is roughly 30k reads+writes per second.
First, there should be a client/server separation model. What does this mean? All client subscriptions go through read-only Redis replicas, not affecting the master Redis handling transactions. All server logic code runs on the master. This significantly increases processing capacity.
Then, design must consider future master scalability. The master can scale by forming a cluster with multiple Redis servers. However, there’s a key limitation: transactions cannot span Redis servers. Therefore, the engine must calculate correlations between tables and place all related tables on a fixed Redis server. But large tables like in relational databases are ultimately all correlated and cannot be decoupled. This requires using an ECS structure.
Simply put, attributes are split into small tables. For example, the money attribute becomes a table called money, referred to as a Component, storing only money and owner attributes. Then, it’s associated (attached) to the relevant player ID via the owner attribute, similar to Unity’s script components. This eliminates large tables, and correlations can be broken down more granularly. Additionally, player (which is just an ID, not physically existing) is the Entity in ECS, and logic code is the System.
---
title: Component Cluster Example
---
graph TD;
subgraph "Cluster 1"
System_A-->Component1;
System_B-->Component1;
end
subgraph "Cluster 2"
System_D-->Component3;
System_D-->Component2;
System_C-->Component2;
end
Conclusion
With this structure, when writing code, there’s no need for any nasty locks, no need to consider data conflicts. You can write as efficiently and focused as in single-threaded programming, with fewer bugs. The project ultimately only needs to consider the decomposition of Component dependencies to control distribution and performance.
These designs are actually quite lightweight. The required code is estimated to be not much, just more mentally taxing. I’m currently developing and have open-sourced it at https://github.com/Heerozh/hetu, which will be used in the next SLG game. Contributions are welcome.
248520b