Archive for category System Administration
Standards Based Hardware Diagnostics
Posted by Charles Wimmer in Scale, System Administration on July 10, 2010
I recently changed jobs. I started working at Yahoo! as a Service Engineer. Part of my job is to administer the servers that we use to run Apache Hadoop. We currently run 38,000 servers. That number is growing by thousands each quarter. It was 25,000 when I interviewed earlier this year.
When operating servers at this scale, your perspective changes. When considering high availability, you have to consider the system as a whole. Your software needs to ensure the availability of your system rather than the hardware. I don’t care what brand or configuration of hardware you purchase, you can not operate 38,000 servers without a hardware failure. Hardware breaks. Fans break. Hard drives fail. Power supplies stop working. Cosmic rays flip bits in memory (Don’t laugh. People are designing systems to detect and work around cosmic rays.) This is a fact of life.
Another impact of this scale is the cost of redundant components. Consider the cost of acquisition of 20,000 servers. If you can save $100 per server by eliminating a redundant power supply and redundant fans, you just saved $2,000,000.
Then end result of these forces is that System Administrators must become efficient at managing hardware failure. It is no longer a question of if servers will fail. The question becomes how many will fail and how fast can you get them back in service.
Managing hardware failure is one of the most time consuming (therefore one of the most expensive) aspects of my job. Here is a comparison to put it in perspective. If I need to reinstall the operating system on 40 servers, it would take me about 30 minutes. I would write a script to edit some PXE config files and then use IPMI to power cycle the systems. I could then turn to another task for the remainder of the installation. This solution scales as well. If I needed to reinstall 4000 servers, it would still take me about 30 minutes.
Consider, on the other hand, a failure in the disk that contains the root partition. In order to diagnose and resolve the problem, I would have to log on to the console via IPMI, examine logs, possibly boot once or twice to figure out the problem. Then, I would have to fill out a ticket for the site operations time to replace the disk. Overall, this might take 10 minutes or more. You can see that managing tens of thousands of servers which have dozens of failures per day would get very cumbersome very fast.
I can hear you now. “My enterprise grade server hardware includes hardware management that eliminates the diagnostics problem.” This is true. For the scale of many enterprises, the combination of enterprise server hardware and the manufacturer’s management software minimizes the expense of diagnosis. There are two problems that make this unusable at a larger scale. First, the cost of server grade hardware is significantly higher than commodity hardware. Sometimes it is as much as double. Second, I have not seen management software that scales to tens of thousands of compute nodes.
Indulge me for a moment while I go off on a tangent. IPMI for large scale system administration is the best thing since sliced bread. I can remotely connect to a server, get on the console, power cycle, see log of some hardware events, read the temperature and fan speed. It works cross platform. I don’t need to worry about which server manufacturer I buy from as long as they follow the IPMI standard.
So, I’ve given a long list of the reasons why hardware sucks. Now, what do we do about it? I would like to see hardware diagnostics become a standard part of the hardware management tool set. Right now, each manufacturer has its own method for hardware diagnostics. Some put a separate partition on the disk. Some give you CD to boot from. Some make it a feature of the fancy remote management card. In order to be able to diagnose hardware at scale, there needs to be an API.
I would like to see hardware diagnostics as ubiquitous as IPMI. Diagnosing server hardware problems needs to be as easy to automate as power cycling a system. S.M.A.R.T. is a decent idea in this area. All the hard drive manufacturers have come together to provide a unified API to access this information from the hard drive. Tools have been written to read this info cross platform as well as cross operating system. Hardware diagnostics might even fit well as an extension to IPMI.
Operating hardware will always be a very manual process, even at a large scale. We need to strive to automate as much of this operation as possible. Hardware manufacturers have been heading in this direction. As sysadmins, we need to continue to push hardware manufacturers to include standards based diagnostics.
New LISA Blog Entry: ZFS: a Filesystem for Modern Hardware
Posted by Charles Wimmer in LISA 2009, OpenSolaris, Solaris, Solaris 10, System Administration on November 3, 2009
I just published a blog entry on the Usenix blog for LISA 2009. “ZFS: A Filesystem for Modern Hardware at LISA 2009″
Where is the Sysadmin in the Brave New World of “Cloud Computing”?
Posted by Charles Wimmer in System Administration on October 31, 2009
Cloud this, Cloud that. It seems all you hear these days about large scale computing is Cloud. If you believe the marketing droids, you’ll never need to buy another piece of hardware. Just rent it from your friendly neighborhood cloud vendor.
It used to be that if you had a ‘hard’ systems problem to solve, you would call upon your system administrator. They would help you analyze your problem and find bottlenecks. They would find the weak link in your system and propose a solution. They would be able to lead you down the path of the best cost/performance ratio solution. They would help you deploy, troubleshoot, and operate it in your data center.
Working with cloud resources, the sysadmin’s job gets both simpler and more difficult. It becomes simpler because it is easier to plan, provision and decommission systems. All of these tasks are literally the click of a button.
The difficulty lies between the provisioning and decommissioning of the systems. Many of the sysadmins’ traditional tools are no longer available. For example, if there seems to be a problem somewhere between the application and database layer, the sysadmin could focus on the network. What is the topology between these hosts? What’s the bandwidth? Is there packet loss? Is there a capacity problem? What does the traffic look like from a SPAN port in between the hosts? If there was a performance problem with a database, they could look at the SAN for the source of the trouble. I could go on and on with examples.
Without access to traditional tools, sysadmins are somewhat hamstrung. Sometimes the only answer will be, “buy a bigger cloud”.
Don’t get me wrong. I am happy to see bandwidth, storage, and compute resources become commodities. It is good for business to be able to purchase and pay for what you use, when you use it. My concern is that we aren’t as frugal with our virtual resources as we are with our physical resources. It is my opinion that *because* the cost of entry is so low, we *must* be more careful when provisioning virtual resources. If we don’t take the time to plan and architect systems properly, cloud computing will likely cost more in the long run.
I think it is the responsibility of today’s sysadmins to make this point to the decision makers in the business. I think it is the responsibility of today’s sysadmins to learn the tools needed to be effective in this pursuit. What are those tools? How can we make those points? I want to explore these questions in future entries.