System Administration

I received an error, now what?

We see them all the time: error messages. They are in our logs, in our monitoring and in our applications. We receive them so often we’ve become immune to them. We even have names for alerts errors we ignore such is known, false or phantom. This is one of the biggest problems with software today. It’s also the main reason many companies haven’t yet implemented automation into their systems and networks.

Our admins must weed through hundreds of errors in order to try and find real ones. If you are attempting to put automation in place this problem is amplified. Automation breaks quickly when it encounters an error. Either a person will have to deal with the error, or the automation system will have to be changed to ignore or deal with it. Ignoring error opens an entire new list of dangers.

It’s time to start an error tracking project. Begin by assigning each error a ticket. No matter how small you believe the error, make it important. Each error must be researched to find the cause. This sounds like a daunting task, and it is. You will find however that it’s going to become a smaller and smaller task the farther you get into it simply because a single problem can cause multiple errors. Once the problem is fixed, the entire group of errors go away.

Stop using the terms “Known Error” or “False error/alert” in your department. These are buzz words, and not only do they not have real meanings they are hurting your department. An error ALWAYS means there is a problem, no exceptions. If the error shows up for no reason, then the notification showing up IS the problem and still should not be ignored. If an alert goes out at 1 AM because a system is down when it was scheduled to be down, that is a problem. You need to fix the way you’re handling downtime or train the person in charge of it.

Stop allowing vendors to tell you an error should be ignored. These are your vendors; they provide something for you. If their product shows there is an error, then there is an error and they need to be held responsible to fix it. Make sure you open a ticket with them and hound them regularly about resolving it.

I’m not going to lie; this exercise will take a lot of time at first. You may have to set aside a month or two in order to get a handle on things. The rewards however more than make up for the work involved. In fact, I would argue it will take less time to fix these problems than it would to allow them to continue.

Consider for a moment how your network would be different if every error and every alert you received pointed to an actual problem. All the noise was gone, and the environment was quiet. No more missing an alert and having a user let you know a problem exists, or worse still a customer leaving and you never knowing why. No more small problems which grow into larger problems if they are ignored. Think about the automation system which can now be put into place and most importantly trusted because you always know what’s going on in your environment.

This is the path to good development, good administration and good testing and the sooner you start working on this project the better. Too often we forget how important errors are and we ignore them. They are your environment telling you something is wrong and needs to be fixed. It’s time you give them a voice.

Leave a Reply

Your email address will not be published. Required fields are marked *