Failures in a Distributed System Case
Autor: sagundrum • April 14, 2015 • Term Paper • 1,053 Words (5 Pages) • 1,265 Views
Failures in a Distributed System
Stephen Gundrum
POS/355
September 9th, 2013
Caleb Green
Failures in a Distributed System
Failures in hardware and software are something that cannot be avoided but steps can be taken to minimize the impact of those failures. There is not any hardware that is currently available that will not fail at some time within its life span. There many different type of failures but I will be discussing four of them and explaining which ones are applicable to a centralized distributed system.
The first failure is called a transaction failure. A transaction failure is instigated by either an application software error or a system error. An application error results when a transaction can no longer continue because of logical errors in the application software that is used to access the database. The logical error can be identified as a bad input, data not found, and overflow or resource limit exceeded .
The second failure is a system crash. A system crash is classified as a hardware malfunction or a bug in the software that results in the system crashing and loosing the contents in the main memory. This crash can force the computer to into an error state or for a restart of the system. A restart of the computer will usually resolve these issues.
The third failure is a disk failure. This occurs when a hard disk drive fails. The failure can be either a mechanical failure where the drive motor stops working or the read / write heads touch the platters within the hard drive. All hard drives had a circuit board attached to them. Any electrical component on the board can fail and that would show up as either a full failure of the hard disk drive or read / write errors will start showing up.
The fourth failure is a transient failure. A transient failure is one that affects a piece of the system for just a short period of time. These are usually caused by bugs in the program coding. Most bugs in programs are caught before the finished program is released to the public. The bugs that remain are usually the ones that are difficult to locate and show up as resource leaks and environment dependent bug in the OS and the application (Argyros, 2012). Sometimes these bugs will crash the machine at the most inopportune time and cause distress amongst the users. The nice thing about this type of failure is that a reboot of the system usually resolves the issue. The bad thing is that any transactions that had not been completed are going to be lost. This type of failure can usually only be resolved by updates and patches from the program vendor.
Now that we know what some of the failures that can occur, we need to look at how we can diagnosis where the fail occurred, why there was a failure and what is the best way to resolve the issue so that it does not reoccur.
Failures with hardware are usually simple to diagnosis and repair. Hard drives are usually setup in a RAID configuration when they are installed in a server. Depending on the RAID configuration, recovering the data when a hard drive is replaced can be simple or difficult. There are 4 RAID configurations that are used. RAID level 0 is used for system performance. Data is stripped across two hard drives. It is not recommended to use this configuration in a critical system. RAID level 1 is considered mirroring. What this means is that one hard drive is a mirror of another. If one fails, the other picks up the slack with no loss of data. RAID level 5 is used with three disks. This level gives good performance and redundancy. This level is usually used for database applications where there is a lot of reading going on but writing back to the database is going to be slow. RAID level 10 is also known as RAID level 1+0. This is because the data is stripped across two hard drives and then those two hard drives are mirrored to two other hard drives. This configuration requires four hard drives (Natarajan, R. 2010).
...