I got OpenTelemetry to work. But why was it so complicated?
A number of our customers have recently been asking whether we support OpenTelemetry, the observability framework and toolkit. When the first member of the team who’s client-facing asked me – the go-to guy for metrics and logging at IPF – I barely let them finish: “of course! That’s just Prometheus and Jaeger, right? We’ve supported that for years!”
Well, it turns out I was more or less wrong on all counts. Yay! Let’s get into it.
OTel?
As I mentioned up there, OpenTelemetry, or OTel, is an observability framework and toolkit. What does this mean? Well, to start, let’s think about the sort of tools that predated OTel in this space. A nice triad you could use to get started could be:
- Prometheus for metrics
- Things like Logstash for exporting logs to a central aggregator like Elasticsearch (other log exporters and aggregators are available)
- OpenTracing for distributed tracing
These are three different standards from three different organisations. OpenTelemetry tries to round up, and – more importantly – standardise the tooling in this space by codifying it under three so-called signals: metrics, logs and traces. In addition to just providing a specification, they have created:
- The OpenTelemetry Protocol (OTLP) which allows applications to report data on the aforementioned primitives to what is known as the…
- …OpenTelemetry Collector: a (and I quote) “Vendor-agnostic way to receive, process and export telemetry data.”
- Language SDKs for 10+ languages which implement OTLP and exporting of telemetry data
These are the core components they have created. There are others, but this will do for now. If you’re confused, here’s a diagram from their root docs page (which is in itself a tell!)
As we can see, “microservices” (remember those?) can report signals to the OTel Collector using auto-instrumentation (“zero code”), the API, or one of the language SDKs. But it’s not restricted to your applications: your infrastructure can also send signals to the OTel Collector!
So this sounds well and good, and indeed a noble cause. As a bonus it ticks all the buzzword boxes: open-source, vendor-agnostic, language-agnostic, distributed, zero-code (conditions apply; see below).
So what’s wrong?
No strangers to observability
We are not new to this space. We’ve had docs on these topics for a while now, and our customers have used these in production for some time. Want to get IPF to report to an ELK Stack? No problem. General docs on monitoring and observability? Sure thing!
As a framework, we never know what environment an app using our libraries is going to be deployed into. Therefore, we have to stay abstract and never make any vendor-specific choices one way or another. I’m starting to sound like the OTel people. This means we have to spend a lot of time building in overrideable sensible defaults, abstractions (ensuring they are not crappy and/or leaky ones), and so on.
Having said that, we can make recommendations: we usually recommend that logs are pushed to some aggregator like Elasticsearch, LogScale, etc, and by default we enable and support Prometheus metrics as they’d become a de facto standard. That is, until OTel came along.
Easy wins: logs and metrics
While logs and metrics are a fairly well-known quantity amongst our customers, tracing is not. There must be something in the water, because we got three almost simultaneous requests asking about tracing with OTel. And while you’re using OTel for tracing, you might as well use it for the other two things as well!
It was fairly easy for our customers to move logs and metrics to OTel, since to enable logs and metrics you just need to add a new appender and some new config respectively. But what do we do for tracing?
Tracing my headache
When you trace a transaction through a distributed system, you need to pass some information between the systems to correlate a specific invocation with a specific transaction. This is known as context propagation.
A trace is a parent/wrapper for a set of individual units of work, called spans. A trace can contain multiple spans, and spans can be nested. Imagine an e-commerce website: clicking the “buy now” button initiates a trace involving the frontend talking to the backend in one span, the backend activates a new child span to talk to the payment/shipping/order management services, and each one of those may activate a new child span to talk to some other downstream system further down. All the distributed services report their part of the trace to the OTel collector (if using OTel) and OTel builds a holistic view under a single trace ID. Well, that’s the theory anyway.
In a typical fashion in our industry where we have the opportunity to gather around one standard for context propagation, we decided to make several. Obligatory reference to the XKCD comic: Standards. The context propagation implementations that OTel MUST(!) support are:
I don’t really understand why there are four, but I think what’s happened is convergent evolution: different sets of people – unaware of one another – came to a shared conclusion regarding how to solve a specific problem. Honourable mention for W3C creating two standards for the same thing. Actually, if you read the Baggage spec, they claim that it’s independent of Trace Context. So why does OTel let me propagate the Trace Context over Baggage? I have no idea. Let me know in the comments. Does our blog have comments?
Now, we already had support for the now-deprecated OpenTracing within the Akka parts of our app using Lightbend Telemetry support for OpenTracing. Then it got better: in version 2.20.0 of that library, they introduced support for OpenTelemetry logs/events and metrics, but – crucially – not tracing! Theoretically this doesn’t matter, because OTel tracing is really just a rebranding of OpenTracing, and Lightbend Telemetry’s OpenTracing supports context propagation using the above four methods anyway. So it should just work…right?
Of course not. We have a new fly in the ointment.
Clash of APIs
IPF makes use of both Spring and Akka for different purposes: we use Spring Boot – and Spring IoC in general – to bootstrap the application and its config, build a set of dependencies, and so on. We use Akka for pretty much everything else: event sourcing, scheduling, clustering, sharding, integration, and so on. But of course we can’t stop customers from doing whatever they want, and so some of them tend to use – for example – a Spring REST controller, or perhaps a @KafkaListener
-annotated method to initiate payment flows.
Without OTel, up until now, this was fine:
- Metrics: Spring and Akka expose different Prometheus endpoints so won’t step on each other’s toes
- Logs: We use SLF4J and Logback which is common to both frameworks
- Traces: Both used OpenTracing
When using OTel, metrics and logs were still being sent through fine based on the fact that Spring and Akka were operating independently of one another. However, when it comes to tracing, we need them to be aware of one another so that they can propagate the trace context within the JVM. The problem is now that there are two different APIs being used inside the app to reflect the same thing, and guess what: they aren’t talking to one another. Here’s a diagram if you’re struggling (I certainly am):
Why are there four different values? I was hoping for three! The correct behaviour would be for the Akka HTTP client to reuse the same trace ID 123
from the left hand side, but make a new span ID to indicate that a new unit of work has started which is part of the same 123
trace. But instead, what we have are two completely different traces with no relation to one another.
The silver lining is that It looks like the OTel people were aware of this, and created opentracing-shim
: a way for you to dress up an OTel Tracer
as an OpenTracing Tracer
. But this doesn’t work either, because Lightbend Telemetry uses a custom Tracer implementation which makes the shim and Jaeger freak out, and, instead of any of these things doing their job and propagating a trivial string-string map, spams the log with:
Expected to have an OpenTelemetry Span but got cinnamon.opentracing.TraceLocal$ContextOnlySpan |
And in Jaeger’s case:
Expected to have a JaegerSpanContext but got io.opentelemetry.opentracingshim.SpanContextShim |
So Jaeger is freaking out because it’s getting OTel, and OTel is freaking out because it’s getting Lightbend. Why is everybody so precious about their own implementations?! All to move around a map with four scalar values?
Into the weeds
Both OTel and Lightbend Telemetry instrumentation use a Java Agent which hooks into specific method calls of specific classes and reports on their activity to the tracer. The Lightbend Telemetry instrumentation is not open source, so I had to do a bit of digging into decompiled tracer code to figure out what on earth is going on.
I built a cut-down reproducer which replicates the diagram above, in which I attempted to see where exactly between the green and red box the trace context is being dropped, and see if I could help it along manually.
The first thing I did was set up a breakpoint on the Jaeger freak-out point (here, if you want to play along) where it’s complaining that it’s getting a shim and not real OpenTracing. My thought process was: okay, this is close to what I want to happen (Jaeger to adopt the OTel context), so can I do something shim-free to help it along?
The caller that was causing it to freak out was OpenTracingAkkaPersistenceActorInstrumentation.class
, a proprietary Lightbend Telemetry instrumentation class. I traced the Jaeger breakpoint to this point in the stack which looks promising:
It was line 153 that calls into Jaeger and upsets it, but the first clue was on line 152: if var5
– an OpenTracing SpanContext
– is not null, we attach our new span as a child of the current active span. If var5
is null, the span has no relationship to the existing trace, and if var5
is not a Jaeger SpanContext then Jaeger freaks out. So how can we make var5
something that is both null and Jaeger-friendly?
On line 147, the SpanContext
is being populated from this.traceLocal.currentContext()
. So can I access this from outside somehow and just create a new Jaeger SpanContext myself and stop using the shim?
I then breakpointed on the constructor of OpenTracingAkkaPersistenceActorInstrumentation
to see how traceLocal
is initialised. Behold another decompiled wonder:
Okay, var4
is an ExtendedTracer
, and from my extensive trawling of Lightbend Telemetry docs, I remembered they had written that it’s possible to access this ExtendedTracer programmatically using the following magical incantation:
GlobalExtendedTracer.get() |
Things are so simple once you know what you’re doing!
This returns an ExtendedTracer
and indeed calling local()
returns the same TraceLocal
that is being used by the instrumentation library. I also noticed that this TraceLocal
has an activateContext
method which takes an OpenTracing SpanContext
. So if I ditch the shim and manually convert the OpenTelemetry Context into a Jaeger SpanContext, then:
- Jaeger will be happy because it gets its own SpanContext implementation
- The shim will be happy because…it’s not being used
- I will be happy because the trace ID should be propagated (albeit manually) between OTel and OpenTracing
We can convert an OTel context into a OpenTracing one manually using the inject and extract operations from the propagators API: I tell OTel to populate (“inject”) a generic Java Map with the context, and I then populate (“extract”) the Jaeger SpanContext with the values of the map.
Here’s what it all it looks like if you’re technically oriented (you must be if you’ve read all the way down here):
//make an empty map to populate var otelContext = new HashMap<>(); //inject OTel context into the map GlobalOpenTelemetry.get().getPropagators().getTextMapPropagator().inject(Context.current(), otelContext, (carrier, key, value) -> carrier.put(key, value)); //populate new JaegerSpanContext from aforementioned map var openTracingContext = new TextMapCodec(false).extract(new TextMapAdapter(otelContext)); //activate this new context before going into Akka GlobalExtendedTracer.get().local().activateContext(openTracingContext); //do stuff here - either OTel or Akka - because they should both be sharing the same context now
Okay, let’s spin it up, and:
Yes! We can see that it’s working correctly and ticking all the boxes:
- We have a single trace connecting everything from the initial Spring REST call to
/submit
- We can mix usages of stuff instrumented by OTel and OpenTracing APIs
- The trace is propagated across an HTTP boundary (bottom two spans)
I gave the two different instrumentations two different names above to highlight that they are two kinds of tracing. But in real life we’d recommend to customers that they line up the otel.service.name
and the cinnamon.application
to appear as one application to the outside world.
Reflections
From looking at the decompiled Lightbend Telemetry library, I can see that rewriting it to use OTel APIs would be a big effort: they would have to port their many instrumentations – which are currently tied to the Jaeger API – to use the OTel API. But the silver bullet for them is that the OTel Collector supports collecting traces in the legacy Zipkin format, so they could release complete OTel support by relying on the legacy support, and they probably don’t have too many customers like us who are having this kind of clash.
To answer the titular question of why it was so complicated, I guess the short answer is that it’s self-inflicted: by trying to combine two different instrumentations which use two different tracing libraries we’re kind of asking for trouble. I think the OTel project has done a good job of attempting to apply some standardisation in this space (see for example semantic conventions). It starts off a little bit complex to wrap your head around, but I believe it’s a great FOSS project full of great people doing good things.
I am a little bit concerned about how or if Akka will correctly pass the trace context between threads inside its actor model. It seems to be working as expected from my small amount of load testing, but I have opened a ticket with them and can update this if there’s enough demand!
I hope that this will be of help to somebody. And if not, thanks for the free therapy session. Until next time!