Building an online MMO Part 1: the First Attempts and Why they Failed

Austin Hicks

2014-03-02 15:56

Those watching me on Twitter are no doubt aware that I've been working on something. Last weekend was quite frankly the most productive I've ever had, and the net result was the very lowest level components of an online game. I have since taken it further, and stand at working terrain as of this posting.

I have decided that I will blog my progress, along with notes and information on the architecture as it develops. This means that I will no doubt have to say what changed, but I will of course post why it didn't work or what I gain from the new model. It is amazing how little information exists on this subject: there are some tutorials on making basic board games and sporatic information on how a real online MMO works, but nothing complete. I have worked a lot of this out myself or brainstorming with fellow developers, and very little of it by reading tutorials. The intent is that this will be an ongoing series, for as long as I maintain interest in writing it. Rather than jump straight in and say what I have currently, I want to start with my first two attempts and why they failed. I think this is important if only to provide some background. It also taught me why I shouldn't do certain things. It also taught me how to use Gevent, and succeeded in convincing me that Gevent is the right path.

Note that Christopher Toth (@mongoose_q on Twitter) is a co-programmer, but I've done 90% or more of the programming thus far. His advice and thoughts have been extremely hellpfull, and I can't take sole credit for my current success (but in all honesty I can take credit for my failures).

This is quite lengthy.

Attempt 1: Gevent, JSON, and actions

When: About 4 days near the end of December, 2013.

Achieved: One or more players could connect to a server and walk around, but speed depended on network.

Twisted is hard at first. I can program in Twisted now, but my skill is limited. I suspect that this would improve with practice, and attempt 2 did teach me that Twisted is not as bad as it looks. It's also the only real tool for solving some problems in Python: it can talk to just about anything without protocol reimplementation.

Nevertheless, for attempt 1, I didn't seriously look at it. I found Gevent. Gevent is great. No deferreds, no delayed actions, just a bunch of little threads-but-not. I still stand by this statement, but am not going to go into detail of why I do now until part 2.

A large body of code was going to be shared between the client and the server. This was a module of actions and a module of objects. The server would tell the client, or the client would tell the server, to execute a specific action. Serialize them with Jsonpickle, bounce them back and forth. If you code them deterministically, all clients have a detailed and accurate state of the world. Clients don't have to wait, either: if they perform an action locally and send it to the server, it's the same as waiting for it to bounce back.

At least, that's what I thought. The first way this model broke down was the last part: you cannot in fact execute all actions on the client first, and you need to start flagging which ones can be and which ones can't. Alone, that's not a huge problem, but the problems kept snowballing.

The second issue is object transport. The client needs to know about all objects that an action wants to touch, and quite a few that will never have anything done to them. The solution is of course to send them somehow. This immediately starts placing limits on order of messages. Jsonpickle to the rescue? No. This isn't enough to kill the model. This complicates it immeasurably. If objects always directly reference other objects, the entire world comes along with them. If an object doesn't send the entire world, it can end up arriving before the objects it references. God help you if there's a cycle.

Even this doesn't kill the model, and the model itself could be made to work-just with a great deal of code. Restrictions such as no cycles, or cycles must have an indirect reference that looks up by id can be made, and they will work. The code was becoming a huge mess, however, and only working because these restrictions were implicit.

Gevent greenlets are addictive. Every client had a greenlet responsible for sending and receiving messages for it, and also for handling its own state. Client greenlets were responsible for sending out messages from the server pub-sub style, by reading from a single queue. I wrote a clever object (which still might have uses) that allowed one to obtain "views" of the queue, so that multiple reading greenlets can be at different points in time.

This immediately introduces one of several major problems. Clients must join at a consistent world state, and at no other time. There are two ways to do this under the model I came up with. First, save state from the last consistent state, have all clients join at that state, and update it when new consistent states are reached. By playing back actions since, reconstruct to the current state. Second, wait for the next consistent state and join then. The first option required an incredibly high degree of determinism. The second required locks.

The knowledgeable reader will ask a question at this point. Why not have the clients request? Requesting state was not an option because the world may not have been in a consistent state-actions were atomic and deterministic but not in the time sense, and two-part actions were in the pipeline. I am going to talk more about this problem and my solution in part 2. I never programmed non-instant actions, but they were planned.

Incredibly high degree of determinism is hard: no floats, no randomness, everything must happen in the same order for all observers.

Locks are even worse: they snowball out from the first object with them, and require all sorts of special case code. Because of the way Gevent works, it was possible for clients to join in the middle of a state update, and only acquire part of it. It was also possible for clients to join in the middle of an outgoing message pass. This meant that the lock was needed anyway.

Finally, the speed of outgoing messages for a client depended on the speed of incoming messages.

By the time I was done dealing with all of this, the implementation was literally incomprehensible. The lesson here is that only one greenlet should ever be permitted to touch game state if at all possible, at least on a per-process basis. The code was incredibly awful; objects were simply posted whenever they were needed, and replaced in the client. The client greenlets contained too much logic concerned with gameplay, and literally inserted new objects. No single problem killed it alone, but the combination of all of these and the lack of programming for the future left me with a sort-of working server that no one could code further.

And to say walking is cheritable. The speed of the player depended on the network. In the process of fixing it, I ended up with something that never worked again.

Lessons:

Locks are bad. There are better methods of concurrency, and if locks become involved something may be wrong with design. There may be a time and a place where locks are appropriate, but this was not it.
Each client needs two greenlets, not one: one for sending and one for receiving. Client greenlets may be concerned with authentication and security, but should never touch game state directly.
Jsonpickle requires care, as does any other form of automated serialization. It may or may not be better to write something custom.

Attempt 2: Twisted Perspective Broker

When: Late December and January.

Achieved: Nothing working.

I did not learn the lessons from the previous model. Some of them are quite subtle. Adding twisted was like throwing fuel on a fire.

I turned to Perspective Broker (hereafter PB) because it solves a number of problems: distribution of state changes through Cacheable, handing actions back and forth as Copyable, and remote method calls in both directions for communication. The action model however had already broken down. The first attempt ended in a bad place: it had so many problems that seeing the major ones required a telescope, and I didn't have a telescope handy.

PB added yet more, however. When an object is on one side or the other, the method by which it is called changes. This is not a bad thing if you program for it, and is in fact useful in a lot of cases. When added to a flawed design, it only makes things worse. A ton of boiler plate code ended up coming into being, and things were becoming "magical" very quickly.

The key problem with bad designs in twisted, at least in my opinion, is that they end up putting deferreds in logic rather than just in networking. If you don't use a deferred and it looks like it is going to be needed, switching over is not a simple task. Since the code was shared between the client and the server, it could conceivably be delayed-if something is out of date, doesn't exist, or the like, it might be deferred.

I have already covered the problems with the underlying model. They lead directly into the above problem: you must do something when an object has not arrived yet. The obvious answer is to defer access to it. The obvious answer is a bad answer, but is not fixable without coming up with a better model.

The lesson and the problem with Twisted can be summarized in two words: Deferreds everywhere. This problem was not Twisted's fault, but rather how I was using it.

The good news is that at no point did I stop understanding my code. It was simply becoming harder and harder to keep moving forward, and the increase was not linear. I can now at least work with basic Twisted code; I never got far enough to learn the libraries beyond PB.

I will close by saying this: the method I am using now does not care about Gevent or Twisted. It would be possible, and probably only the work of a few hours, to rewrite the networking components on top of Twisted. Game logic would probably never even have to import Twisted, and no game logic currently imports Gevent. As this is already quite long, I will save such information for later.