Fixing The Diablo II: Resurrected Servers Sounds Like An Absolute Nightmare

Fixing The Diablo II: Resurrected Servers Sounds Like An Absolute Nightmare
I imagine the last week for the developers has been a bit like this. Image; Kotaku Australia / Blizzard Entertainment

Diablo 2: Resurrected launched, it’s authentic as all hell — but then the D2 servers took an instant trip to the Seventh Circle. For the last week, players have faced constant login issues and outages. And by the sounds of things, the poor server engineers must be absolutely hating life.

First up: any time a developer posts a blog that surpasses 2,000 words, you know the shit has really hit the fan. It’s a massive explainer on all the issues facing Diablo 2: Resurrected players lately, and it’s so extensive because the problems aren’t caused by a single issue but a mix, ranging from an inability to deal with the game’s popularity, its architecture, and even down to the fact that players are just way more efficient at smashing Diablo into the dust in 2021.

The first major problem outlined by the team is how players’ characters and data is stored. If you’ve played any Activision or Blizzard multiplayer game over the last few decades, you’ll know that you generally login to a set of servers as close to your location as humanly possible. It’s not an individual server per se, but a cluster of servers that service an entire region.

Anyway, these servers all have their own regional databases that store the data of the characters that play on them. This is needed because there’s too many people playing Diablo 2 to just continually upload everyone’s data to a single, central point.

“Most of your in-game actions are performed against this regional database because it’s faster, and your character is “locked” there to maintain the individual character record integrity. The global database also has a back-up in case the main fails,” Blizzard wrote.

These regional databases periodically send information back to the central database, so that way Blizzard has a singular record (with backups) of your thicc Level 88 Barbarians, Necromancers and so on. Which sounds all well and good — until that central database gets overloaded and the whole system, much like the engineers working on it, needs a nap.

“On Saturday morning Pacific time, we suffered a global outage due to a sudden, significant surge in traffic. This was a new threshold that our servers had not experienced at all, not even at launch,” Blizzard explained.

This was exacerbated by an update we had rolled out the previous day intended to enhance performance around game creation–these two factors combined overloaded our global database, causing it to time out. We decided to roll back that Friday update we’d previously deployed, hoping that would ease the load on the servers leading into Sunday while also giving us the space to investigate deeper into the root cause.

On Sunday, though, it became clear what we’d done on Saturday wasn’t enough–we saw an even higher increase in traffic, causing us to hit another outage. Our game servers were observing the disconnect from the database and immediately attempted to reconnect, repeatedly, which meant the database never had time to catch up on the work we had completed because it was too busy handling a continuous stream of connection attempts by game servers. During this time, we also saw we could make configuration improvements to our database event logging, which is necessary to restore a healthy state in case of database failure, so we completed those, and undertook further root cause analysis.

Not exactly the recipe for a fun weekend, that. It also explains why players were having so many issues with progress, too. You’d pick your character, start a game, play for a while, but the regional server couldn’t communicate with the central database after an outage. So it couldn’t tell Diablo 2’s source of “ground truth” about the new gear and XP you’d gained, resulting in frustrated players losing some of the progress they’d made.

The problems only got worse from there. The Diablo 2 servers came back online, but they did so during a period when most players were online — so even though the servers rebounded quickly, they crashed almost straight away as soon as hundreds of thousands of Diablo 2 instances fired up.

And if the weekend was bad, what followed on Monday and Tuesday wasn’t any better:

This leads us into Monday, October 11, when we made the switch between the global databases. This led to another outage, when our backup database was erroneously continuing to run its backup process, meaning that it spent most of its time trying to copy from the other database when it should’ve been servicing requests from servers. During this time, we discovered further issues, and we made further improvements–we found a since-deprecated-but-taxing query we could eliminate entirely from the database, we optimised eligibility checks for players when they join a game, further alleviating the load, and we have further performance improvements in testing as we speak. We also believe we fixed the database-reconnect storms we were seeing, because we didn’t see it occur on Tuesday.

This is the point where I keep hearing my brother’s advice in my head: “Never get into networking.”

Somehow, Diablo 2 hadn’t had enough. The game enjoyed its best ever highs for concurrent players on the Wednesday Australian time — after almost a week of constant login issues and crashes. Blizzard says there were “a few hundreds of thousands of players in one region alone” — which could either be a lot or relatively standard, depending on how Blizzard’s servers define regions. (A few hundred thousand would be hugely impressive for, say, Australia. For a “region” like the United States, not so much, but if that region was a small part of the United States, then maybe it would be. The blog post doesn’t specify here.)

According to the devs, one of the biggest problems causing all of this is how the original Diablo 2 handles core pieces of player behaviour. While Vicarious Visions updated the original D2 code where they could, a large part of the project was keeping what code worked.

Which was fine, up until the point where it no longer started to scale.

Diablo 2 has a particular way in which it pulls data from the central database to make sure players can do the things they want to do. Joining a game? That’s calling back to the central database. Want to switch characters? That’s another check to central command to make sure you get the character you asked for, in the spot where you left it, with all the gear you’d worked for.

Diablo 2, according to the team, was designed to be centralised. The downside of that means that only a single instance of this particular service can be run at any one time, so they can’t offload some of the weight to regional servers.

“Importantly, this service is a singleton, which means we can only run one instance of it in order to ensure all players are seeing the most up-to-date and correct game list at all times,” the devs wrote. “We did optimise this service in many ways to conform to more modern technology, but as we previously mentioned, a lot of our issues stem from game creation.”

For now, there’s a range of short-term solutions and roadmaps to rewrite Diablo 2‘s architecture so it can better scale for modern demand. The service that just provides a list of games to players, for instance, is being broken out into a service of its own.

The devs will also be introducing a login queue, ala World of Warcraft, to prevent situations where the servers get overloaded when hundreds of thousands of game instances are launched all at once:

To address this, we have people working on a login queue, much like you may have experienced in World of Warcraft. This will keep the population at the safe level we have at the time, so we can monitor where the system is straining and address it before it brings the game down completely. Each time we fix a strain, we’ll be able to increase the population caps. This login queue has already been partially implemented on the backend (right now, it looks like a failed authentication in the client) and should be fully deployed in the coming days on PC, with console to follow after.

Players will also be rate limited, but only in instances where games are being created, closed and recreated in short spaces of time, which is mostly instances where players are farming areas like Shenk & Eldritch or Pindleskin. “When this occurs, the error message will say there is an issue communicating with game servers: this is not an indicator that game servers are down in this particular instance, it just means you have been rate limited to reduce load temporarily on the database, in the interest of keeping the game running,” Blizzard advised.

It all sounds like an absolute nightmare, to be honest, and I feel for the engineers who have what looks like months of retroactive fixes in front of them. There’s a school of internet thought that says, well, Blizzard should have seen this coming and planned for it. But that’s also fundamentally part of the risk you take with remasters. These games were written back in an age where information and multiplayer services didn’t have the popularity or ease of access that we have today, and it’s difficult to know whether a lot of that old infrastructure scales the way we think it might. Sometimes it does — right up until the point where it all collapses in a flaming heap.

Comments

  • You should address the fact that they created this problem by forcing an online connection for single player games and not allowing local servers for multiplayer.

    • “But that’s also fundamentally part of the risk you take with remasters.”
      Like Darath says… I don’t think this is part of the risk with remasters so much as it is part of the risk with imposing an always-online connection for solo players. If they’d stuck with the original title’s solutions to the problem of cheat-using offline characters infecting online play, I can only imagine the server issues would’ve been significantly less severe.

      This is a new problem that they introduced not out of consideration for decades-old constraints, but out of consideration for entirely new, greedy, modern corporate bullshit constraints.

      • The risk with remasters is in the legacy code, especially if there’s a multiplayer component. The more legacy code you keep, the more you run the risk of things being exposed by a scale that simply didn’t exist when those games were originally created.

        Devs might think it’s capable of standing up, but you don’t really know until it’s fully out in the wild. And even imposing the always-online connection might not have solved the problem here — the game still has to send back characters to a central database so it has that ground truth check. Maybe that’s eliminated if there is a separate local-only database, so you can’t have offline characters transfer over to online, but that wouldn’t have fixed all of the crashes in this instance anyway, so people would still run into problems.

        The trick here is people thinking offline-only characters would have resolved these problems, but the vast majority of people would have just played with online enabled anyway (especially on console). That volume and scale is what fundamentally broke down — if people are thinking that hundreds of thousands of D2 players coming into a game at launch would have all played offline-only, that’s exceedingly optimistic.

        • Counter-point: People wouldn’t be as mad when the servers are down if they can just hop on to their offline-only alt and still play the game they bought.

          • (Damn, can you imagine if that extended beyond old titles? “I feel like playing some Destiny!” ‘Well you can’t. Servers are down. You have to do something else.” “Awh. But… but I don’t wanna do something else. I feel like playing Destiny.” “WELL NOW YOU CAN! Introducing… the Offline Experience!”)

        • Yep. They would have been all online. Until all the crashes. Then they would’ve switched to offline.

          I think you’re being far too generous to blizzard. Not many companies have the experience and resources for online the way they do. Remember Diablo 3? Then WarCraft 3. Now Diablo 2.

          Once would be bad luck.

        • I never played Diablo 2 online because it was filled with cheaters. So no, I don’t think everyone would’ve been playing online for the remaster, if only due to PTSD from the amount of people flying around the map one shotting people. Some people just want to play the game and it doesn’t need to be online for that to happen. Blizzard screwed it for themselves again by requiring online for single player.

  • The discussion around refunds is interesting in light of the publisher-imposed always-online DRM requirement.

    Clearly the goal of publishers everywhere is to erode the concept of First Sale Doctrine to combat second-hand sales, piracy, and even the ability for anyone to enjoy an older experience that isn’t currently being driven by marketing and social pressure for maximum ROI.

    Every battle fought over First Sale Doctrine and the rights of consumers, every EULA we agree to during installs sees mention of the fact that these are not products bought so much as ‘limited licences’ for highly conditional, revokable access to the product.

    I would dearly love to see this issue examined by various nations’ consumer protection agencies to mandate and enforce some minimum standards on the obligations of publishers towards licensees, with regard to reasonable expectations of availability, quality, and reliability.

    A comment in a previous article scoffed at the idea that someone might obtain 20 hours worth of entertainment from the product then feel entitled to a refund when access becomes patchy and unreliable and/or progress is lost, but it’s very obvious with even a passing familiarity with the franchise how unreasonable a stance that is. 20hrs doesn’t even begin to scratch the surface of what the product actually promises and would reasonably be expected to deliver. There needs to be a real regulatory examination of what a consumer should expect to get when they purchase a ‘licence,’ and it needs to be examined in the context of replacing and abandoning the benefits of First Sale Doctrine.

    Publishers are pushing the ‘limited licence’ excuse to have it all their way and nothing else, and there needs to be push-back against that.

Show more comments

Log in to comment on this story!