A Generalized Model for Distributed Comparison-Based System-Level Diagnosis

Luiz C. P. Albini, Elias P. Duarte Jr., Roverli P. Ziwich
Journal of the Brazilian Computer Society (JBCS'2005),
ISSN 0104-6500, Vol. 10, No. 3, pp. 44-56, Apr, 2005.  [pdf]



Abstract

This work introduces a new system-level diagnosis model and an algorithm based on this model: Hi-Comp (Hierarchical Comparison-based Adaptive Distributed System-Level Diagnosis algorithm). This algorithm allows the diagnosis of systems that can be represented by a complete graph. Hi-Comp is the first diagnosis algorithm that is, at the same time, hierarchical, distributed and comparison-based. The algorithm is not limited to crash fault diagnosis, because its tests are based on comparisons. To perform a test, a processor sends a task to two processors of the system that, after executing the task, send their outputs back to the tester. The tester compares the two outputs; if the comparison produces a match, the tester considers the tested processors fault-free; on the other hand, if the comparison produces a mismatch, the tester considers that at least one of the two tested processors is faulty, but can not determine which one. Considering a system of N nodes, it is proved that the algorithm’s diagnosability is (N–1) and the latency is log2N testing rounds. Furthermore, a formal proof of the maximum number of tests required per testing round is presented, which can be O(N3). Simulation results are also presented.


References

[1]   A. Subbiah, and D.M. Blough, “Distributed Diagnosis in Dynamic Fault Environments,” IEEE Transactions on Paralel and Distributed Systems, Vol. 15 No. 5, pp. 453-467, 2004.

[2]   G. Masson, D. Blough, and G. Sullivan, “System Diagnosis,” Fault-Tolerant Computer System Design, ed. D.K. Pradhan, Prentice-Hall, 1996.

[3]   F. Preparata, G. Metze, and R.T. Chien, “On The Connection Assignment Problem of Diagnosable Systems,” IEEE Transactions on Electronic Computers, Vol. 16, pp. 848-854, 1968.

[4]   S.L. Hakimi, and A.T. Amin, “Characterization of Connection Assignments of Diagnosable Systems,” IEEE Transactions on Computers, Vol. 23, pp. 86-88, 1974.

[5]   S.L. Hakimi, and K. Nakajima, “On Adaptive System Diagnosis,” IEEE Transactions on Computers, Vol. 33, pp. 234-240, 1984.

[6]   S.H. Hosseini, J.G. Kuhl, and S.M. Reddy, “A Diagnosis Algorithm for Distributed Computing Systems with Failure and Repair,” IEEE Transactions on Computers, Vol. 33, pp. 223-233, 1984.

[7]   E.P. Duarte Jr., and T. Nanya, “A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm,” IEEE Transactions on Computers, Vol.47, pp. 34-45, 1998.

[8]   R.P. Bianchini, and R. Buskens, “Implementation of On-Line Distributed System-Level Diagnosis Theory,” IEEE Transactions on Computers, Vol. 41, pp. 616-626, 1992.

[9]   A. Brawerman, and E.P. Duarte Jr., “A Synchronous Testing Strategy for Hierarchical Adaptive Distributed System-Level Diagnosis,” Journal of Electronic Testing Theory and Applications, Vol. 17, No. 2, pp. 185-195, 2001.

[10] E.P. Duarte Jr., A. Brawerman, and L.C.P. Albini, “An Algorithm for Distributed Hierarchical Diagnosis of Dynamic Fault and Repair Events,” Proc. IEEE ICPADS’00, pp. 299-306, 2000.

[11] S. Lee, and K.G. Shin, “Probabilistic Diagnosis of Multiprocessor Systems,” ACM Computing Surveys, Vol. 26, No. 1, pp. 121-139, 1994.

[12] M. Malek, “A Comparison Connection Assignment for Diagnosis of Multiprocessor Systems,” Proc. Seventh Int’l Symp. Computer Architecture, pp. 31-36, 1980.

[13] K.Y. Chwa, and S.L. Hakimi, “Schemes for Fault-Tolerant Computing: A Comparison of Modularly Redundant and t-Diagnosable Systems,” Information and Control, Vol. 49, pp. 212-238, 1981.

[14] J. Maeng, and M. Malek, “A Comparison Connection Assignment for Self-Diagnosis of Multiprocessor Systems,” Digest 11th Int’l Symp. Fault Tolerant Computing, pp. 173-175, 1981.

[15] A. Sengupta, and A.T. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by Comparison Approach,” IEEE Transactions on Computers, Vol. 41, No. 11, pp. 1386-1396, 1992.

[16] D.M. Blough, and H.W. Brown, “The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation,” IEEE Transactions on Computers, Vol. 48, pp. 470-493, 1999.

[17] D. Wang, “Diagnosability of Hipercubes and Enhanced Hypercubes under the Comparison Diagnosis Model,” IEEE Transactions on Computers, Vol. 48, No. 12, pp. 1369-1374, 1999.

[18] G.S. Almasi, and A. Gottlieb, Highly Parallel Computing, The Benjamim/Commings Publishing Company Inc., 1994.

[19] C. Xavier, and S.S. Iyengar, Introduction to Parallel Algorithms, Wiley-Intersciense Publication, 1998.

[20] N.F. Tzeng, and S. Wei, “Enhanced Hypercubes,” IEEE Transactions on Computers, Vol. 40, No. 3, pp. 284-294, Mar. 1991.

[21] T. Araki, and Y. Shibata, “Diagnosability of Butterfly Networks under the Comparison Approach,” IEICE Trans. Fundamentals, Vol

[22] F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, San Mateo, CA, 1992.

[23] J. Fan, “Diagnosability of Crossed Cubes,” IEEE Transactions on Computers, Vol. 13, No. 10, pp. 1099-1104, Out. 2002.

[24] F. Harary, Graph Theory, Addison-Wesley Publishing Company, 1971.

[25] S. Rangarajan, A.T. Dahbura, and E.A. Ziegler, “A Distributed System-Level Diagnosis for Arbitrary Network Topologies,” IEEE Transactions on Computers, Vol. 44, No. 2, pp. 312-333, 1995.

[26] M.H. MacDougall, Simulating Computer Systems: Techniques and Tools, The MIT Press, Cambridge, MA, 1987.