Is observability the sum of monitoring, logging, and tracing?
- Software reliability
- September 18, 2020
Table of Contents
Cover image credit: David Clode on Unsplash
Prometheus and Grafana with top votes in CNCF’s End User Technology Radar on Observability comes as no surprise. However, not seeing Litmus or any other chaos engineering tool is a bit surprising because these are listed in the Observability and Analysis category of the CNCF landscape.
Curiosity led me to watch a recording of one of the meetings hosted by the CNCF SIG for Observability. But questions still linger.
If I learn how to use these cool observability tools really well, will I have a firm handle on the state of any cloud-native system? Also, are chaos engineering tools, such as Litmus, ChaosKube, and PowerfulSeal, sufficient to help me determine the resiliency of such a system?
Is this a case of the tail wagging the dog?
Maybe, but who’s to say that’s not a feasible approach. What if the sum of the parts leads me to the whole? How would I recognize the whole?
As an observability noob, many such questions continue to baffle me.
- What should be the building blocks of my learning journey?
- How should I design my learning pathway?
- Why should I go down a certain path?
Last year, Charity Majors of honeycomb.io wrote a thought-provoking piece that debunked the myth about the three pillars of observability.
She spoke about instrumenting code and capturing details in a way that enable us to answer any question.
Here’s her tweet thread that would bewilder even an observability veteran, let alone a newbie.
but: there are not "three pillars" of observabiity. there is only one data structure that underpins observability: the arbitrarily-wide structured data blob. consider:
— Charity Majors (@mipsytipsy) February 22, 2019
* metrics are trivially derived from it
* logs are a sloppy version of it
* traces are a visualization of it
My first stumbling block: Arbitrarily-wide structured data blobs
Do I really want to store data in three different ways? Probably not, but what exactly does this alternative she’s suggesting mean?
If monitoring helps us figure out only known knowns, how can it help us answer unknown or unanticipated questions?
If I am unaware of what I am unaware of, how can I even begin to know what and where do I need to debug?
With so many moving parts in a distributed system, how can I even create a mental model of the system?
Based on what I’ve been reading and watching so far, tracing is something I can kinda grasp. At least, through the lens of understanding the flow of events or the journey of a service request. Monitoring and logging seem inapplicable to me when I don’t yet know what I don’t know. Also, how do I account for the silos created by decoupled metrics and lost context?
If monitoring helps us figure out known knowns, how can it help us answer unknown or unanticipated questions? And what about process variations? Do we consider common and special causes?
If I am unaware of what I am unaware of, how can I even begin to know what information I’d need to debug failure or glitches. With so many moving parts in a distributed system, will log aggregation help me spot a needle in a needle stack? Scalyr seems to think so, and here are two interesting blogs that showcase their claims:
- Log Metrics: 5 Things You Can Learn From Your Log Files
- Logging Best Practices: The 13 You Should Know
Choosing the right tool for an unknown job. Where does one begin?
With so much talk about observability, SRE, and chaos engineering, I’m beginning to ponder about the foundational elements of reliability and resiliency. While enthusiasm about the tools draws much attention, few are considering the skills, the game plan, and the worldview that lead to the choice of the right tool for the job.
- What should be the fundamental building blocks of my learning curve?
- How should I design my learning pathway?
- Why should I even go down a certain path?
Seems like the primary need in research is to be comfortable with volatility, uncertainty, complexity, and ambiguity (VUCA). I haven’t formed an opinion about monitoring and logging, and I might not do so anytime soon.
To understand resiliency, nature seems like the best anchor for cues and inspiration to grok this labyrinth. I am designing and evaluating the efficacy of my learning journey by asking the what, why, and how of resilient, observable, and distributed systems in nature.