Jun 23, 2013

Automation Feedback and making it useful.

We've been running tons of automation. Automation to deploy to a server. Automation to clean the logs. Automation to check the status.. and even automation to check on other automation.

But how do you keep your email box from exploding with status email or just noise?

1: Ask yourself - do you need to know if things are healthy?
2: Can you have self-healing mechanisms in place if things are broken.
3: Does your company have a corporate scheduling engine (Bladelogic, ControlM, or even shared SQL Jobs box)?

Here are a few tips that I found useful:

  • Sending emails on failure only
  • Including reasonable self-healing steps into every script or carving them out into a validation script.
    • for example: Auto cleanup for out-of-space issues, restart IIS, kill off handles to locked file, restarting a server, restart a node in AppFabric cluster, etc..
  • Use TEE command -  pipe console output of your scripts to a log. Any errors not explicitly handled by your scripts will end up at the council and may end up being completely missed.
  • Logs naming convention should allow you to easily group and sort files.
  • If you're using scheduled tasks, have a separate job checking the status of things. It should run under a different account, with a different password expiration date.
  • Keep the email body short.
    • Include the actual error into the email
    • Include the UNC (\\servername\logs\server_name\log_name.html) into the body of the email.
    • Include UNC to the TEE command.
    • Keep the Subject of your emails Informative and Generic:
      "%Servername% failed" is borderline useless as far as subject lines go.
  • You can put all of your logs into a folder accessible through IIS (or tomcat) and just enable directory browsing.

These are just a few ideas.. but you get an idea.

my favorite comic this week..