My notes from “Practical Monitoring” by Mike Julian

I finished this book earlier today, I enjoyed it.

I wanted to write up some notes I took along the way, not as a book review or anything, just to try and help me to remember some of the lessons learned.

  • The message that monitoring is not just for sysadmin/ops engineers is mentioned a couple of times in the book.
  • Focus on monitoring what is working, that is, what makes the app work, instead of a broader metrics like CPU / Memory
  • Focus on the over all monitoring mission, and not just to the specific tool(s) in use at the moment
  • Components of a monitoring service are: Data collection, Data storage, Visualization, Analytics and reporting, Alerting.
  • Use a SAAS tool for monitoring, it costs more than you think to develop in house. Unless you’re Google or Netflix don’t do it.
  • For alerting, automate the solution and remove the alert of possible. If human action is needed, use run books to list out the options and steps to resolve.
  • Incident response management guidelines
  • Front end monitoring is often overlooked, despite having an impact on revenue, page load time can increase over time and impacts on users happiness.
    • is worth looking at in this area. For example it’s possible to measure front end performance impact of every pull request. ( look up webpagetest private instances )
    • For APM, statsD may be worth a look. Node based.
  • Monitoring deployments is often overlooked but worth doing to help correlate deployments against increased error rates in an API for example.
  • Kook in to distributed tracing. Ideal for micro service architectures.
  • Good info on use of /proc/meminfo on page 94, related to server monitoring and how to read it’s output correctly, as well as grepping syslog for OOMKiller, meaning the system is looking to free up some memory.
  • iostat good for disk stats, specially to see transfers per second (tps) also called IOPS.
  • Stop using SNMP. Insecure. Hard to extend. Opt for push based such ad collectd or telegraf
  • For databases, keep an eye in slow queries and IOPs.
  • For queues, such as RabbitMQ, start by monitoring queue size and messages per second
  • For Caches, such as Redis aim for 100% hits. Not always possible to do, but worthwhile aiming for.
  • Auditd is useful for monitoring user actions on a server. It can be told to monitor specific files too. Ideal for watching config files for changes. Use audisp-remote to send logs.
  • Security monitoring
    • Look in to Cloud Passage and Threat Stack
    • Use rkhunter and use a cron entry to keep it updated daily. Set up alerts for warnings in its logs.
    • Look in to Network Intrusion Detection (NIDs) and network taps to analyse traffic for stuff that has gotten passed the firewall

Leave a Reply

Your email address will not be published. Required fields are marked *