Spaced Out


Connecting the World

 

April 22nd 2002

One Good, Two Better

 

If you have one computer, you can do a lot: Word-processing, spreadsheets, web surfing email, music, games, videos, file swapping, and a host of esoteric jobs. The computer is useful, educational and entertaining. Until a friend or family member wants to use it too, at the same time as you do. Then you wish you had two, or maybe more.

 

However, how many computers does one person need, or rather, how many can one-person use? One should be enough, but is two any better? Depends on what you want to do. For normal everyday chores one is the best, two can be quite confusing. Put two computers side by side, and suddenly there are two different filing systems, two different application environments. Work done on one computer is missing from the other. So you need to network them and map file systems and do a lot of system administration to make them work like one machine. Believe me, it is no fun!

 

As the affluence and needs of home computer users grow, many families are opting to posses multiple computers. The multiplicity of computers is intended to support a multiplicity of users. Home networking systems are cropping up that allow multiple machines to access the Internet all at the same time. Even file systems can be put together such that almost all files are accessible from almost all machines (note the disturbing “almost”). However, there are situations when one person may want to use more than one machine.

 

The use of multiple computers by a single person or by a single application was first suggested about 25 years ago, and was dubbed “distributed computing”. Initially it was just an idea, and idea that could not quite be justified in reality. Yet, researchers felt attracted to it and started designing applications and algorithms that worked on multiple machines. The goal was to orchestrate the behaviors of many machines such that they did a useful task together—somewhat like teamwork done by people. Soon it was apparent that such coordination between computers, however interesting, was quite difficult to achieve in real life.

 

Of the many different reasons expounded for harnessing the power of multiple machines, one idea stood out. If one computer takes an hour to do a task, two should get it done in half the time. Maybe one computer is better suited at filing and the other at computing; putting the two together would give us a machine that excels at both tasks. Wonderful as it may sound, the use of multiple computers has turned out to be daunting at best.

 

Home computer users perform interactive tasks, that is, the computer and the user interact heavily, via the keyboard and the mouse. Hence using two machines by one human would mean the person has to physically move between machines entering keystrokes and mouse clicks. That is neither comfortable nor attractive; hence interactive computing is best done on one machine at a time.

 

The situation is different in other applications of computers. The majority of the computing done by industrial, scientific and data processing centers tend to be non-interactive. That is, the lonely computer sits in a lonely room all by itself, busily working away crunching huge amounts of data with no human intervention. This is where multiple computers are often used. They are supposed to work together, keep each other company and get work done quicker. The humans hang around praying that the machines do not gang up and foul things up big time.

 

For example, as the weather forecasting systems became more and more prevalent and complex, it was soon noted that producing an accurate region-by-region prediction for a large country takes several days of computing on one computer. Thus the prediction for Monday would get computed by Tuesday. Since such predictions are useless, we need multiple computers. Data corresponding to different regions are pumped into different computers and the predictions appear on time.

 

The above is a particular case of distributed computing called Parallel Processing. Parallel processing is like what we would do if we had to build a long brick wall. If we had one bricklayer to work on the wall and it would take him days to complete the structure. But we could get two workers and ask one to do the left side, and the other to do the right side, and the job gets done quicker. However, there is a problem. While the workers work many feet apart, they stay out of each other’s turf, but as they get to the middle things do not work out well. The bricks of the wall may not merge smoothly, they may be at different levels and the wall looks discontinuous. Such boundary discontinuities are just an eyesore in walls, but totally unacceptable in data processing. Hence computer programmers invented a gamut of techniques called synchronization that ensured that the machines fall into step when they process data adjacent to each other.

 

The synchronization requirements depend on how the wall is built. As we said, it seems obvious that two workers should start at two ends. Such a technique is called coarse-grained parallelism. In coarse-gained parallelism the computers stay out of each others way most of the time. In the case of weather forecasting as we can make one computer do the Eastern part and one do the Western part. But sometimes, due to dependencies in the data, the computers need to work closely together. This is akin to two bricklayers, working on the same section of the wall. As one worker lays a brick, the other worker places the next one besides (or on top of) it. Then the first worker lays the next one, and they proceed in lock step. Adding more workers speed it up even more, but coordination gets harder. This sort of computing is called fine-grain parallelism.

 

Fine grain parallelism is elegant and watching the wall grow really fast as the workers work in unison is a thing of beauty. Such lock step programs are awkward to write, and while the programs run, subtle things go wrong. One misstep, or one delay can start to gum up the works. The errors can multiply and strange behaviors called “race conditions” begin happening. Due to the complexity of fine-grain parallelism academicians study the problems that require such computing; practitioners just bite the bullet and run them on a single computer.

 

Multiple computers are not only used for parallel computing but for managing failures. Simply put, if we have two computers, we can expect that if one fails, the other one will keep working. Especially if these machines are physically separated— located in different cities, connected to different power supplies. The ability of multiple machines to keep running a program when one would have failed is called “fault-tolerance”.

 

Computers not only fail, but contrary to popular belief they make mistakes too. We see errant behaviors all the time, especially when we use computer networks. Email sent sometimes disappear, a web page that should load does not (the second attempt may work). In critical computing applications, such errors are intolerable, hence the need for fault tolerance.

 

Suppose we have handwritten a long manuscript and want it typed into a word processor. If we get ask one typist to do it, there will be many typos. If we get three typists to do it, there still will be typos, but the error made by one person, will not be made by the other—the errors are expected to be at different places. Hence, in theory, if we meticulously compared the different files, we can detect the errors (an error is where one document differs from the two others).

 

The above situation is not as foolproof as it may seem. Who is going to merge the documents? If one person (or computer) merges them, then what about errors made by the merging person, these will go undetected. Maybe we should use three computers to merge the document and then use three computers to check the merged documents and keep repeating till the documents are identical. But in the end, who decides that the job is done? Such iterative processing keep on reducing the chance of errors, but cannot completely eliminate them.

 

Researchers have found the problem of error-free computing rather challenging. Many solutions exist, but none are perfect. Giving up, is not an option, our current lifestyles demand that many critical systems work all the time, perfectly—air traffic control systems, international banking systems, space exploration systems, national defense systems, to name a few.

 

At the other end of distributed computing spectrum is the magic of the World Wide Web. Web sites scattered all over he world seem to work in haphazard unison, delivering web pages to the browser. Search engines link up the web, and a user can jump between continents in a matter of clicks. Mail keeps whizzing around via gateways, relays, sorting station, from and to mailboxes. We have not yet attempted to use any fault-tolerant techniques in the web—as of present that sounds much too complicated. Yet, it may not be too far away, when technology ensure that when you click what you want, you get what you want, at the first click.

 

Partha Dasgupta is on the faculty of the Computer Science and Engineering Department at Arizona State University in Tempe. His specializations are in the areas of Operating Systems, Cryptography and Networking. His homepage is at http://cactus.eas.asu.edu/partha