Logic is a tech support person's greatest weapon. Used correctly it can cut troubleshooting time dramatically. This time logic just did not work right.
Warning: This is going to be a long one so forgive me if it becomes too technical. I'll try to keep it in English.
Ok, picture this: A tape library unit with two tape drives inside. Drive A is on the left, Drive B is on the right. You load the library with tapes and an internal robotic arm grabs the proper tape and puts it into one of the drives. When the tape is done the arm grabs the tape from the drive and puts it back in the library. This library unit is connected to a server machine with special backup software installed. There is a thick cable going from the back of the server into the back of the library and a cable going from the back of the library to Drive A and a third cable going from Drive A to Drive B. Got it? Good.
For some reason Drive A decides it doesn't want to work any more, so we call for service and a technician comes out and replaces it. The entire operation only takes about 30 minutes. He unhooks the cable from Drive A and plugs his laptop directly into it and runs a diagnostic process.
After about 20 minutes the diagnostic is complete and he reconnects Drive A and goes on his merry way. I check the connections and restart the server and start my own test process on Drive A. It fails after about 5 minutes. Yikes! This is a brand new drive and logic says that because the technician's diagnostic passed there is nothing wrong with the hardware. So I call tech support for the software and the battle begins.
The tech support person is nice but everything we try doesn't work and we're quickly running out of options to check on. We get to the system event viewer and find error messages that points to a hardware failure. Logic says there is nothing wrong with the hardware. The software must be confused. Thus ends the help of the software tech.
So I set out on my own to use logic and track down the problem once and for all. I switch out one of the cables. Fail. I switch out another cable. Fail. I try a different tape. Fail. I switch out a third cable. Fail. I reboot the library. Fail. I reboot the server. Fail. I start switching cables around. Fail, fail, and fail. Now I'm getting desparate. Maybe the drive's internal ID numbers are messed up. Fail, fail, fail, fail, and fail. I pull out the power of the Internet and try tweaking some hardware settings. Success! But the drive is now so slow, it might as well be a fail.
My last success and logic tells me the problem is one of three things: The new Drive A is bad, or the new cables are bad, or the port that the cable plugs into is bad. It's too late to do anything about it now, so I head home. I'll call the hardware technician and make him fix it.
The next morning the hardware technician comes back and replaces all the cables. Fail. I'm starting to see a pattern here. It's not the cables. They were tested and are fine. We switch the drives so that Drive A is now on the right and Drive B is on the left. Logic says if the drive is bad Drive A will fail. If the port is bad, then Drive A will work and Drive B will fail. The test works! AH HA! Logic says the port is bad! In your face bad port!
The hardware guy suggests we go ahead and run a test on Drive B. It should fail and that will prove without a doubt that the port is bad. That test works... huh? What? Now wait a minute.
So it's not the port? What does logic have to say now? I run another test on Drive A. Success! I run another test on Drive B. Success! Ok, this is just silly. I suggest we swap the drives again and re-run the test. So now Drive A is back on the left and Drive B is back on the right. I run a test on Drive A. Success! What the? Stupid logic.
So that's how we fixed the dreaded Tape Library Drive of Doom™. I still don't know what was wrong.
16 years ago
No comments:
Post a Comment