GMAIL Down Today, Routing and Network Design Problems?

VN:F [1.9.6_1107]
Rating: 0.0/5 (0 votes cast)
By Wayne Lawson II on September 2nd, 2009

Many of our blog followers and students experienced email problems today – by this outaged.  Do any of our “CCIE Candidates” want to give a “technical guess” at what the routing issue (briefly explained below) was caused by – and how it could have been avoided?…..I’m not convinced that the “request routers” they’re discussing below are actually “Cisco Routers”, but….they still do something very similar…route traffic! Anyone have any “design practice suggestions” that we should send over to Google?! ;-)

Taken from businessinsider.com:

Google has explained why its Gmail Web mail service went down today. In short, they took a few Gmail servers down for routine maintenance. But the remaining servers couldn’t handle the rest of the load — ironically, because of recent changes designed to make the systems more reliable — and everything stacked up until it crashed.

Not great that Gmail was down, but good that Google is so open about the problem and parts of its solution. We doubt it lost much serious business today. (Especially because the email servers were up the whole time; justGmail‘s Web interface wasn’t. If actual mail were lost, that’d be a different story.)

Here’s Google’s explanation from Ben Treynor, an engineering VP:

Gmail’s web interface had a widespread outage earlier today, lasting about 100 minutes. We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there’s a problem with the service. Thus, right up front, I’d like to apologize to all of you — today’s outage was a Big Deal, and we’re treating it as such. We’ve already thoroughly investigated what happened, and we’re currently compiling a list of things we intend to fix or improve as a result of the investigation.

Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.

The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google’s architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.

What’s next: We’ve turned our full attention to helping ensure this kind of event doesn’t happen again. Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don’t have sufficient failure isolation (i.e. if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We’ll be hard at work over the next few weeks implementing these and other Gmail reliability improvements — Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.

Share and Enjoy:
  • RSS
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • Print
  • Technorati
  • Slashdot
  • LinkedIn
  • del.icio.us
  • Reddit
  • Sphinn
  • Mixx
  • Blogplay
  • Netvibes
  • NewsVine
  • Live
  • Ping.fm
  • MySpace
  • Yahoo! Bookmarks
  • Yahoo! Buzz

8 Responses to “GMAIL Down Today, Routing and Network Design Problems?”

  1. ananth says:

    It is qos issue .

    Google needs to understand that when it is giving such a service for free .
    There should be proper traffic analysis.
    Gmail is out of Beta and also sold to organisations .
    So Service delivery is affected.
    But google will surely rectify

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  2. ananth says:

    It is qos issue .

    Google needs to understand that when it is giving such a service for free .
    There should be proper traffic analysis.
    Gmail is out of Beta and also sold to organisations .
    So Service delivery is affected.
    But google will surely rectify

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  3. Steve says:

    I know Google is a heavy Juniper shop but I read “request router” as a load balancer. I am not sure if what they use for that.

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  4. Steve says:

    I know Google is a heavy Juniper shop but I read “request router” as a load balancer. I am not sure if what they use for that.

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  5. As google self stated this was caused by a lack of resourses (routers) and also a minor design mistake …
    I believe that i’ts not correct for a router to just tell the others to just stop sending traffic anyways they allready have the solution and as stated they are increasing router capacity and making some minor changes which will result in that the traffic is handeled correct.
    But I bet they did not do the load testing properly because this could be tackeled during the testing phase

    Regards,

    Iwan Hoogendoorn
    CCIE3 #13084 (R&S / Security / SP)
    Sr. Support Engineer – IPexpert, Inc.
    URL: http://www.IPexpert.com

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  6. As google self stated this was caused by a lack of resourses (routers) and also a minor design mistake …
    I believe that i’ts not correct for a router to just tell the others to just stop sending traffic anyways they allready have the solution and as stated they are increasing router capacity and making some minor changes which will result in that the traffic is handeled correct.
    But I bet they did not do the load testing properly because this could be tackeled during the testing phase

    Regards,

    Iwan Hoogendoorn
    CCIE3 #13084 (R&S / Security / SP)
    Sr. Support Engineer – IPexpert, Inc.
    URL: http://www.IPexpert.com

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  7. Pat says:

    HHmm.. redundancy isnt robust enough to handle an outage.. where have I heard this one before… This doesnt sound like a router Issue Usually your bottlenecks in mutli site web server environments are either the Load ballancers or 3DNS (Wide area load ballancers that load ballance DNS quires to the internet) I have to agree with Iwan here routers dont tell other to stop sending traffic. sounds like he was trying to use less technical terms to dumb it down

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
  8. Pat says:

    HHmm.. redundancy isnt robust enough to handle an outage.. where have I heard this one before… This doesnt sound like a router Issue Usually your bottlenecks in mutli site web server environments are either the Load ballancers or 3DNS (Wide area load ballancers that load ballance DNS quires to the internet) I have to agree with Iwan here routers dont tell other to stop sending traffic. sounds like he was trying to use less technical terms to dumb it down

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)

Leave a Reply