While I knew the importance in principle what I didn’t know was that I would be spending the next 6 months making the concept of collaborative enterprise solutions a reality for the company that is the impetus behind the social / collaborative phenomenon. Earlier this year I had the opportunity to take over a team at Facebook while a Senior Manager went on leave and was tasked with turning around a struggling ITSM implementation. This was all fairly straightforward and it wasn’t long before I had the ship righted and was seeking out new ways to innovate and extend their SaaS platforms.
It was at this point that Facebook came to us because they needed an evolution in their existing systems to handle running the day to day operations for their new flagship state of the art datacenters.
Given Facebook’s meteroric rise and growth (600 million 30 day actives, #1 most trafficked internet site) they needed a system to manage operations including server repairs as well as inventory, asset and part management on a scale that most have never seen before. The new data centers will double the amount of repair actions and servers being processed and tracked throughout their lifecycle, so the new system would have to scale accordingly while delivering a requested 100% improvement over the existing systems performance.
The project was nick named 4-Clicks because it was the goal of the Director of Datacenter Operations, that each server repair could be identified, assigned, and have repairs completed in just 4 clicks of the mouse. That wasn’t the only imperative, we had four main critical success criteria:
- Lightning Fast – Facebook built the largest database of people on the planet and they did it with a responsiveness seldom seen in enterprise software. They measure page load times in milliseconds and expected the same from this solution.
- Intuitive – Facebook doesn’t come with a user manual and either should this tool. The expectation is that any user should be able to walk up to the application and begin solutioning repairs with little training or instruction.
- 4-Clicks – Only 4 clicks to solution a server repair start to finish.
- Parts Master – selecting tickets for review automatically stages the parts that need to be pulled from inventory to complete a repair.
I really wanted to build facebook a killer app based on their existing design principles and patterns. When people go to facebook they at a glance can gleam all relevant information and updates from their social network. I wanted to achieve the same for technicians in the datacenter as they go about their day. All relevant information about the data center, server status, health, emergency repairs, and preventative maintenance are pushed to them like in a newsfeed.
Having the freedom to utilize (read: borrow) the existing Facebook design patterns helped us with our secondary goal of delivering a compelling intuitive user experience. If you are familiar with Facebook you should be able to walk up and start maintaining and repairing servers using our tool.
Each server has its own wall so a tech can easily see current status, repair history and what the pre-determined diagnosis is for why the machine is not functioning as expected. We even utilized the Facebook Graph API to dynamically pull a users current profile picture when building server repair history.
Additionally the tool Integrates with Facebook’ homegrown server discovery/management tools to deliver repair tickets that come with an existing diagnosis identifying the exact problem parts that need to be replaced and their specific location/serial numbers for verification. When a server is repaired our tool will transition any new parts used in the repair out of the consigned inventory system as well as mark old parts as replaced for RMA. Integrations exist to communicate the parts failure to the supplier for immediate stock replenishment.
- Built on top of the Service Now SaaS Platform integrating Incident, Inventory, Asset, Management and the CMBD as one solution to track datacenter operations repairs, and parts replacements/replenishments
- Utilizes the Facebook Graph API to build a social enterprise application based on familiar facebook design patterns.
- Delivered a scallable solution to automate and streamline repairs and track the status and health of hundreds of thousands of servers and millions of server component parts in the CMDB.
- Integrates with existing Facebook Server management tools to deliver an repair ticket that comes with an existing diagnosis identifying the exact problem part that needs to be replaced and it’s location/serial number for verification.
- Changing the service management conversation to include real time social collaboration.
- An entirely Ajax solution meaning like facebook we have designed a no-navigation user interface. Every click loads the appropriate form and data without navigating away from the current page context
- Implements HTML5 multi-threading and caching meaning multiple processes handle request concurrently for lightenting fast speed and efficiency. Data is periodically cached mimizing the load on the system
- Built on top of the world class Service-Now SaaS Platform. Extended existing Incident, Asset, Inventory, and CMDB modules to build out a total solution to document and manage the Facebook Server Infrastructure. Utilizing the Service-Now Platform gives us the ability to tap into existing Service Management methodology and functionality. Adding SLA’s, custom notifications, or mapping service dependencies to hardware clusters is available out of the box.
- Breaking Ground on Our First Custom Data Center http://blog.facebook.com/blog.php?post=262655797130
- Too Big To Fail: Facebook Begins Building Its Own Data Center In Oregon http://techcrunch.com/2010/01/21/facebook-data-center/
- Facebook To Build Its Second Data Center To The Tune Of $450 Million http://techcrunch.com/2010/11/11/facebook-to-build-its-second-data-center-to-the-tune-of-450-million/