The biggest news story in the Internet startup world last week was the unexpected downtime of Amazon Web Services (AWS) that brought down dozens (if not hundreds) of websites and web applications including Reddit, Quora, and Foursquare. This wasn't the first time AWS experienced an outage though it has been more than 3 years since the last one. No cloud hosting is immune to massive outages and with most Internet software startups opting to use them instead of building in-house infrastructure it is important to show some of the architectural strategies that account for such unfortunate events.
Twilio, a New York startup that connects web applications with phone lines, shares some of their design principles in their freshly-created engineering blog that helped them to remain available during the outage. The main reason is the highly selective use of AWS only when the service offered matches their strict requirements:
- Unit-of-failure is a single host
- Short timeouts and quick retries
- Idempotent service interfaces
- Small stateless services
- Relax consistency requirements
The blog post is heavy on engineering terms and concepts so see comments for clarifications if needed.