But how do you keep your email box from exploding with status email or just noise?
1: Ask yourself - do you need to know if things are healthy?
2: Can you have self-healing mechanisms in place if things are broken.
3: Does your company have a corporate scheduling engine (Bladelogic, ControlM, or even shared SQL Jobs box)?
Here are a few tips that I found useful:
- Sending emails on failure only
- Including reasonable self-healing steps into every script or carving them out into a validation script.
- for example: Auto cleanup for out-of-space issues, restart IIS, kill off handles to locked file, restarting a server, restart a node in AppFabric cluster, etc..
- Use TEE command - pipe console output of your scripts to a log. Any errors not explicitly handled by your scripts will end up at the council and may end up being completely missed.
- Logs naming convention should allow you to easily group and sort files.
- If you're using scheduled tasks, have a separate job checking the status of things. It should run under a different account, with a different password expiration date.
- Keep the email body short.
- Include the actual error into the email
- Include the UNC (\\servername\logs\server_name\log_name.html) into the body of the email.
- Include UNC to the TEE command.
- Keep the Subject of your emails Informative and Generic:
"%Servername% failed" is borderline useless as far as subject lines go. - You can put all of your logs into a folder accessible through IIS (or tomcat) and just enable directory browsing.