In 2009, Edrans implemented a monitoring system for Zappos, one of the most important shoes and clothing online store in the US, which on the first year allowed the company to quickly optimize its IT infrastructure, decrease by 95% the error rate and improve overall latency on the website by 50%, maximizing business variables up to the point where that very same year Zappos reached 1 billion dollars in sales. Very shortly after Amazon acquired it, Zappos´ business kept growing exponentially. Together with them we continued monitoring the activity of the platform, which currently require to have visibility over 10 different business units where there are more than 15,000 variables being analyzed every day.
Zappos already had a monitoring tool in place, but because of its poor performance it wasn't able to keep up with the growth of the platform, which forced the IT team to spend too much time trying to keep the tool up to date and running instead of focusing on analyzing and following business variables and IT processes.
The challenge was to develop a monitoring system to automate the visualization of numerous business metrics to manage IT in a more efficient way.
First steps: Inside Zappos During the first phase of the project, our priority was to identify the most beneficial metrics to watch, not only for the IT department to keep its operations working efficiently, but also to understand how we could help the business on every level. The teams at Zappos and Edrans worked back to back analyzing each system used by Zappos. We had to understand not only the technology but also mainly the business, from the call center up to the marketing strategies and distribution centers. That way, we could help achieve one of the 10 core values at Zappos, to provide a “WOW” experience to their clients.
The first big goal: The holidays season Meanwhile, the high season, the biggest sales season in the US, was rapidly approaching, and we needed to stabilize the existing monitoring platform and implement some of the basic metrics that were essential for Zappos to have their platform ready.
Step 1: Graph metrics, number of errors, sales per minute, latency times and many more. Metrics for each layer of the infrastructure, from the webservers all the way down to the databases. We wanted to know how Zappos performance was at each step of the sales process, from searching a product till the payment process. During this phase, our client gained high visibility, which later would allow to identify and adjust areas where improvements were needed. The results of this were convincing: in just a few months, they were able to reduce the error rate by 95% and the overall latency improved by 50%. That year, Zappos reached 1 billion dollars in sales and Amazon acquired it.
New Phase: The Expansion With the high season over, a new phase was starting: to provide the highest possible visibility on each system, giving them more control over the environment and increasing productivity. We started by collecting performance data so we could analyze trends. In parallel, we had to work on reducing the number of incidents and to solve them quickly with an efficient monitoring solution able to detect the root cause of the problem quickly. We already had thousands of new metrics for analyzing trends, capacity planning, configuring alerts and resolving problems. At this point, we realized that the monitoring tool used at that time was not capable of supporting the extra workload added by all the new metrics. What to do then? With the rapid growth of the monitoring implementation and the visibility on every system, we had to migrate to a more scalable, easy to use and robust solution. It was decided to use the worldwide standard for IT monitoring, Nagios (version 3). This way, we opened ourselves to a wide range of new things we could do. From this moment, we had the ability to monitor everything we wanted and the changes at Zappos became evident very quickly.
The company now had a clear picture of every aspect of their business from sales, e-mail marketing campaigns, user experience, security, and even fraud control. With all these tools, Zappos gained recognition in the market, providing added value in predictive technology, which allowed to identify problems before these affected the clients. The IT team at the company already operated at maximum capacity and it was ready to analyze and deal with big sales events. This was the scenario for Cyber Monday 2011, the day on which every online sales websites quintuple the normal levels of traffic and orders, and Zappos exceeded the test by far, once again offering a “WOW” experience to their clients.
The big step: Integration with Amazon Amazon watched Zappo's growth and changes very closely. That year it acquired the company and since then, the challenge was to integrate each application and the distribution center with Amazon systems. Edrans had to implement the monitoring instrumentation adding new metrics and statistics provided by Amazon in just two months. Once again, our challenge started from scratch because we needed to learn, comprehend and implement a new system with a different technology from the one we had previously worked. With the flexibility of the monitoring solution, we were able to respond to the new requirements and create metrics which were never seen before neither by Zappos or Amazon. The result was once again a success: the Edrans team was able to develop a custom solution, which allowed Zappos to detect problems even before Amazon. Currently, due to the magnitude of the business, the challenges proposed by Zappos and Amazon are very important. It is the force that motivates us to keep innovating at Edrans, with a team of professionals working back to back with our client.
- 24x7 monitoring service to every Zappos business unit
- Control more than 15,000 daily metrics.
- More than 5,000 graphs collecting all kinds of data, helping reduce the number of incidents by speeding up investigations and trending analysis.
- Collect and analyze data from each of those metrics from different areas of the infrastructure once a minute and trigger alerts.
- More than 400 runbooks written so far, allowing the Edrans team to provide a first-class support as well as allowing Zappos to understand each metric, helping to reduce the resolution time for problems if an alert gets triggered.
Solution in numbers
Decrease error rate
daily metrics monitored