Zeeshan Lakhani Cofounder and organizer of Papers We Love and PWLConf, applied researcher working on local-first software (among other things) at fission.codes, and PhD student at Carnegie Mellon University, studying Programming Language Theory.

Composing an observable Rust application

21 min read 6013

Composing Observable Rust Application

The observability of today’s complex, dynamic systems is predicated on domain knowledge or, more so, on the unknown “unknowns” that arise from incomplete domain knowledge. In other words, the cases that fall between the cracks and surprise us, as illuminated in the below quote:

It took us no more than an hour to figure out how to restore the network; several additional hours were required because it took so long for us to gain control of the misbehaving IMPs and get them back to normal. A built-in software alarm system (assuming, of course, that it was not subject to false alarms) might have enabled us to restore the network more quickly, significantly reducing the duration of the outage. This is not to say that a better alarm and control system could ever be a replacement for careful study and design which attempts to properly distribute the utilization of important resources, but only that it is a necessary adjunct, to handle the cases that will inevitably fall between the cracks of even the most careful design – (Rosen, RFC 789)

At its heart, observability is how we expose a system’s behavior (in some disciplined way, hopefully) and make sense of such behavior.

In this article, we’ll discuss the importance of observability and we’ll investigate composing the underpinnings of an observable Rust application.

Jump ahead:

Why is observability important?

Our systems have become more complicated due to the proliferation of microservice deployments and orchestration engines, where major companies are running thousands of microservices, and even startups are operating in the hundreds.

The harsh reality of microservices is that they suddenly force every developer to become a cloud/distributed systems engineer, dealing with the complexity that is inherent in distributed systems. Specifically, partial failure, where the unavailability of one or more services can adversely impact the system in unknown ways. – (Meiklejohn, et. al., Service-Level Fault Injection Testing)

In these complex times, achieving observability goes a long way into architecting, troubleshooting, and benchmarking our systems for the long haul. Delivering observability starts with the gathering of our output data (telemetry and instrumentation) from our running systems at the right level of abstraction, normally organized around a request path, so that we can then explore and dissect data patterns and find cross-correlations.

On paper, this sounds somewhat easy to achieve. We gather our three pillars — logs, metrics, and traces — and we’re done. However, these three pillars on their own are just bits, whereas gathering the most helpful kinds of bits and analyzing the collection of bits together holds the most complexity.

Forming the right abstractions is the hard part. It can be very domain-specific and relies on building a model for our systems’ behavior, which is open to change and prepared for surprises. It involves the developer having to get more involved with how events in their applications and systems are generated and diagnosed.

Throwing log statements everywhere and collecting every possible metric loses long-term value and causes other concerns. We need to expose and augment meaningful output, so that data correlation is even possible.

This is a Rust article after all, and while Rust was built with safety, speed, and efficiency in mind, exposing system behavior was not one of its founding principles.

How can we make Rust applications more observable?

Starting from the first principles, how do we go about instrumenting code, collecting meaningful trace information, and deriving data to help us explore unknown “unknowns”? If everything is driven by events and we have traces that capture a series of events/operations, which include requests/responses, database reads/writes, and/or cache misses, etc., what’s the trick to go from nothing to something for a Rust application that has to communicate with the outside world in order to achieve end-to-end observability? What do the building blocks look like?

Sadly, there’s not just one trick or silver bullet here, especially when writing Rust services, which leaves a lot for the developer to piece together. First, the only thing we can truly rely on to understand and debug an unknown “unknown” is telemetry data, and we should make sure we surface meaningful, contextual telemetry data (e.g., correlatable fields like request_path, parent_span, trace_id, category, and subject). Second, we need a way to explore that output and correlate it across systems and services.

In this post, we’ll mainly be concerned with collecting and gathering meaningful, contextual output data, but we’ll also discuss how best to hook into platforms that provide further processing, analysis, and visualization. Luckily, the core tool(s) are out there for instrumenting Rust programs to collect structured event data and process and emit trace information, for async and sync communication alike.

We’ll focus on the most standard and flexible framework, tracing, which is situated around spans, events, and subscribers, and how we can take advantage of its composability and customizability.

However, even though we have an extensive framework like tracing at our disposal to help us write the underpinnings of observable services in Rust, meaningful telemetry doesn’t come “out of the box” or “fall out for free.”

Getting to the right abstraction in Rust is not as straightforward as it may be in other languages. Instead, a robust application must be built upon layered behaviors, all of which provide exemplary controls for developers in the know, but can be cumbersome for those more inexperienced.

We’re going to break up our problem space into a series of composable layers which function over four different units of behavior:

  • Storing contextual information for the future
  • Augmenting structured logs with contextual information
  • Deriving metrics through instrumentation and span durations
  • OpenTelemetry interoperability for distributed tracing

Akin to how the original QuickCheck paper on property-based testing relied on users to specify properties and provide instances for user-defined types, building end-to-end observable Rust services comes with having to have some understanding of how traces are generated, how data is specified and maintained, and what telemetry makes sense as applications grow. This is particularly true when it comes time to debug and/or explore inconsistencies, partial failures, and dubious performance characteristics.

Trace collection will drive everything in this post’s examples, where spans and events will be the lens through which we tie together a complete picture of known quantities. We’ll have logs, but we’ll treat them as structured events. We’ll gather metrics, but have them automated via instrumentation and spans, and we’ll export OpenTelemetry-compatible trace data for emission to a distributed tracing platform like Jaeger.

Spans, events, and traces

Before we get into the implementation details, let’s start with some terms and concepts we need to be familiar with, such as spans, traces, and events.


A span represents operations or segments that are part of a trace and acts as the primary building block of distributed tracing. For any given request, the initial span (without a parent) is called the root span. It is typically represented as the end-to-end latency of an entire user request of a given distributed trace.

There can also be subsequent child spans, which can be nested under other various parent spans. The total execution time of a span consists of the time spent in that span as well as the entire subtree represented by its children.

Here’s an example of a purposefully condensed parent span log for a new request:

level=INFO span name="HTTP request" span=9008298766368774 parent_span=9008298766368773 span_event=new_span timestamp=2022-10-30T22:30:28.798564Z http.client_ip= http.host= http.method=POST http.route=/songs trace_id=b2b32ad7414392aedde4177572b3fea3

This span log contains important pieces of information and metadata like the request path (http.route), timestamp (2022-10-30T22:30:28.798564Z), request method (http.method), and trace identifiers (span, parent_span, and trace_id respectively). We’ll use this information to demonstrate how a trace is tied together from start to completion.

Why is it called a span? Ben Sigelman, an author of Google’s Dapper tracing infrastructure paper, considered these factors in A Brief History of “The Span”: Hard to Love, Hard to Kill:

  • Within the code itself, the API feels like a timer
  • When one considers the trace as a directed graph, the data structure seems like a node or vertex
  • In the context of structured, multi-process logging (side note: at the end of the day, that’s what distributed tracing is), one might think of a span as two events
  • Given a simple timing diagram, it’s tempting to call the concept a duration or window


An event represents a single operation in time, where something occurred during the execution of some arbitrary program. In contrast to out-of-band, unstructured log records, we will treat events as a core unit of ingest happening within the context of a given span, and structured with key-value fields (akin to the span log above). More precisely, these are called span events:

level=INFO msg="finished processing vendor request" subject=vendor.response category=http.response vendor.status=200 vendor.response_headers="{\"content-type\": \"application/json\", \"vary\": \"Accept-Encoding, User-Agent\", \"transfer-encoding\": \"chunked\"}" vendor.url=http://localhost:8080/.well-known/jwks.json vendor.request_path=/.well-known/jwks.json target="application::middleware::logging" location="src/middleware/logging.rs:354" timestamp=2022-10-31T02:45:30.683888Z

Our application can also have arbitrary structured log events that take place outside of a span context. For example, to display configuration settings at startup or monitor when a cache is flushed.


A trace is a collection of spans that represent some workflow, like a server request or queue/stream processing steps for an item. Essentially, a trace is a Directed Acyclic Graph of spans, where the edges connecting spans indicate causal relationships between spans and their parent spans.

Here’s an example of a trace visualized within Jaeger UI:

Rust Trace Visualized with Jaeger UI

If this application were part of a larger, distributed trace, we’d see it nested within a larger parent span.

Now, with the terms out of the way, how do we even get started implementing the skeleton of an observability-ready Rust application?

Composing multiple tracing layers to build a subscriber

The tracing framework is split up into different components (as crates). We’ll focus on this set of .toml dependencies for our purposes:

opentelemetry = { version = "0.17", features = ["rt-tokio", "trace"] }
opentelemetry-otlp = { version = "0.10", features = ["metrics", "tokio", "tonic", "tonic-build", "prost", "tls", "tls-roots"], default-features = false}
opentelemetry-semantic-conventions = "0.9"
tracing = "0.1"
tracing-appender = "0.2"
tracing-opentelemetry = "0.17"
tracing-subscriber = {version = "0.3", features = ["env-filter", "json", "registry"]}

The tracing_subscriber crate gives us the ability to compose tracing subscribers from smaller units of behavior, called layers, for collecting and augmenting trace data.

The Subscriber itself is responsible for registering new spans when created (with a span id), recording and attaching field values and follow-from annotations to spans, and filtering out spans and events.

When composed with a subscriber, layers tap into hooks triggered throughout a span’s lifecycle:

fn on_new_span(&self, _attrs: &Attributes<'_>, _id: &span::Id, _ctx: Context<'_, C>) {...} 
fn on_record(&self, _span: &Id, _values: &Record<'_>, _ctx: Context<'_, S>) { ... }
fn on_follows_from(&self, _span: &Id, _follows: &Id, _ctx: Context<'_, S>) { ... }
fn event_enabled(&self, _event: &Event<'_>, _ctx: Context<'_, S>) -> bool { ... }
fn on_event(&self, _event: &Event<'_>, _ctx: Context<'_, S>) { ... }
fn on_enter(&self, _id: &Id, _ctx: Context<'_, S>) { ... }
fn on_exit(&self, _id: &Id, _ctx: Context<'_, S>) { ... }
fn on_close(&self, _id: Id, _ctx: Context<'_, S>) { ... }

How are layers composed in code? Let’s start with a setup method, producing a registry that’s defined with four with combinators, or layers:

fn setup_tracing(
    writer: tracing_appender::non_blocking::NonBlocking,
    settings_otel: &Otel,
) -> Result<()> {
    let tracer = init_tracer(settings_otel)?;
    let registry = tracing_subscriber::Registry::default()

The setup_tracing function would typically be called upon initialization of a server’s main() method in main.rs. The storage layer itself provides zero output and instead acts as an information store for gathering contextual trace information to augment and extend downstream output for the other layers in the pipeline.

The with_filter method controls which spans and events are enabled for this layer, and we’ll want to capture essentially everything, as LevelFilter::TRACE is the most verbose option.

Let’s examine each layer and see how each operates over trace data collected and hooks into the span lifecycle. Customizing each layer’s behavior involves implementing lifecycle hooks associated with the Layer trait, like on_new_span below.

Along the way, we’ll demonstrate how these units of behavior augment span- and event-log formatting, derive some metrics automatically, and send what we’ve gathered to downstream, distributed-tracing platforms like Jaeger, Honeycomb, or Datadog. We’ll start with our StorageLayer, which provides contextual information from which other layers can benefit.

Storing contextual information for the future

On new span

impl<S> Layer<S> for StorageLayer
    S: Subscriber + for<'span> LookupSpan<'span>,
    fn on_new_span(&self, attrs: &Attributes<'_>, id: &Id, ctx: Context<'_, S>) {
        let span = ctx.span(id).expect("Span not found");
        // We want to inherit the fields from the parent span, if there is one.
        let mut visitor = if let Some(parent_span) = span.parent() {
            let mut extensions = parent_span.extensions_mut();
            let mut inner = extensions
                .map(|v| v.to_owned())

                PARENT_SPAN, // "parent_span"
        } else {

        let mut extensions = span.extensions_mut();
        attrs.record(&mut visitor);

When a new span (via on_new_span) is initiated, for example, a POST request into our application to an endpoint like /songs, our code checks to see if we’re already within a parent span. Otherwise, it will default to a newly created, empty Hashmap, which is what Storage::default() wraps under the hood.

For simplicity, we default the map to keys of string references and values of copy-on-write (Cow) smart pointers around string references:

#[derive(Clone, Debug, Default)]
pub(crate) struct Storage<'a> {
    values: HashMap<&'a str, Cow<'a, str>>,

Storage persists fields across layers in a registry within the lifecycle of the span due to a span’s extensions, giving us the ability to mutably associate arbitrary data to a span or immutably read from persisted data, including our own data structure.

More great articles from LogRocket:

Many of these lifecycle hooks involve wrestling with extensions, which can be somewhat verbose to work with. The registry is what actually collects and stores span data, which can then bubble through to other layers via the implementation of LookupSpan.

The other code to highlight is attrs.record(&mut visitor) which will record field values of various types by visiting each type of value, which is a trait that must be implemented:

// Just a sample of the implemented methods
impl Visit for Storage<'_> {
    /// Visit a signed 64-bit integer value.
    fn record_i64(&mut self, field: &Field, value: i64) {
            .insert(field.name(), Cow::from(value.to_string()));
    ... // elided for brevity
    fn record_debug(&mut self, field: &Field, value: &dyn fmt::Debug) {
        // Note: this is invoked via `debug!` and `info! macros
​​        let debug_formatted = format!("{:?}", value);
        self.values.insert(field.name(), Cow::from(debug_formatted));    

Once we’ve recorded all the values for each type, all of this is stored within the Storage Hashmap by the visitor, which will be available for downstream layers to use for lifecycle triggers in the future.

On record

impl<S> Layer<S> for StorageLayer
    S: Subscriber + for<'span> LookupSpan<'span>,
... // elided for brevity
    fn on_record(&self, span: &Id, values: &Record<'_>, ctx: Context<'_, S>) {
        let span = ctx.span(span).expect("Span not found");
        let mut extensions = span.extensions_mut();
        let visitor = extensions
            .expect("Visitor not found on 'record'!");
... // elided for brevity

As we continue through each lifecycle trigger, we’ll notice that the pattern is similar. We get a mutable, scoped handle into the span’s storage extension, and we record values as they arrive.

This hook notifies a layer that a span with a given identifier has recorded the given values, via calls like debug_span! or info_span!:

let span = info_span!(
    subject = "vendor.cdbaby",
    category = "vendor"

On event

impl<S> Layer<S> for StorageLayer
    S: Subscriber + for<'span> LookupSpan<'span>,
... // elided for brevity
    fn on_event(&self, event: &Event<'_>, ctx: Context<'_, S>) {
        ctx.lookup_current().map(|current_span| {
            let mut extensions = current_span.extensions_mut();
            extensions.get_mut::<Storage>().map(|visitor| {
                if event
                    .any(|f| ON_EVENT_KEEP_FIELDS.contains(&f.name()))
... // elided for brevity

For our contextual storage layer, hooking into events, like tracing::error! messages, is generally unnecessary. But, this can become valuable for storing information on event fields we’d like to keep that may be useful in another layer later in the pipeline.

One example is to store an event if it was attributed to an error so that we can keep track of errors in our metrics layer (e.g., ON_EVENT_KEEP_FIELDS is an array of fields tied to error keys).

On enter and on close

impl<S> Layer<S> for StorageLayer
    S: Subscriber + for<'span> LookupSpan<'span>,
... // elided for brevity
    fn on_enter(&self, span: &Id, ctx: Context<'_, S>) {
        let span = ctx.span(span).expect("Span not found");
        let mut extensions = span.extensions_mut();
        if extensions.get_mut::<Instant>().is_none() {

    fn on_close(&self, id: Id, ctx: Context<'_, S>) {
        let span = ctx.span(&id).expect("Span not found");
        let mut extensions = span.extensions_mut();
        let elapsed_milliseconds = extensions
            .map(|i| i.elapsed().as_millis())

        let visitor = extensions
            .expect("Visitor not found on 'record'");

            LATENCY_FIELD, // "latency_ms"
            Cow::from(format!("{}", elapsed_milliseconds)),
... // elided for brevity

Spans are inherently tagged time intervals, with a clear-cut beginning and end. For the scope of a span, we want to capture the elapsed time between when the span with a given id was entered (Instant::now) until it is closed for a given operation.

Storing the latency for each span within our extensions enables other layers to derive metrics automatically and benefits exploratory purposes while debugging event logs for a given span id. Below, we can see the opening and closing of a vendor task/process span with id=452612587184455697, which took 18ms from start to finish:

level=INFO span_name=vendor.lastfm.task span=452612587184455697 parent_span=span=452612587184455696 span_event=new_span timestamp=2022-10-31T12:35:36.913335Z trace_id=c53cb20e4ab4fa42aa5836d26e974de2 http.client_ip= subject=vendor.lastfm application.request_path=/songs http.method=POST category=vendor http.host= http.route=/songs request_id=01GGQ0MJ94E24YYZ6FEXFPKVFP
level=INFO span_name=vendor.lastfm.task span=452612587184455697 parent_span=span=452612587184455696 span_event=close_span timestamp=2022-10-31T12:35:36.931975Z trace_id=c53cb20e4ab4fa42aa5836d26e974de2 latency_ms=18 http.client_ip= subject=vendor.lastfm application.request_path=/songs http.method=POST category=vendor http.host= http.route=/songs request_id=01GGQ0MJ94E24YYZ6FEXFPKVFP

Augmenting structured logs with contextual information

Now we will get a glimpse into how our storage data can be tapped into for actual telemetry output by looking at our event log formatting layer:


When it comes to writing custom layers and subscriber implementations, many examples tend toward custom formatters:

N.B., the log examples above use this same format, inspired by an implementation from InfluxDB)

We recommend using published layers or libraries or following tutorials like those listed above for diving into the specifics of generating data in a preferred format.

For this article, which builds up our own custom formatter layer, we’ll re-familiarize ourselves with the span lifecycle, specifically for span and event logs, which will now take advantage of our storage map.

On new span

impl<S, Wr, W> Layer<S> for LogFmtLayer<Wr, W>
    Wr: Write + 'static,
    W: for<'writer> MakeWriter<'writer> + 'static,
    S: Subscriber + for<'span> LookupSpan<'span>,
    fn on_new_span(&self, _attrs: &Attributes<'_>, id: &Id, ctx: Context<'_, S>) {
        let mut p = self.printer.write();
        let metadata = ctx.metadata(id).expect("Span missing metadata");

        let span = ctx.span(id).expect("Span not found");
        let extensions = span.extensions();
        if let Some(visitor) = extensions.get::<Storage>() {
            for (key, value) in visitor.values() {               
... // elided for brevity

The above code is using the MakeWriter trait to print formatted text representations of span events. The call to decorate_field_name and all the printer write methods execute specific formatting properties under the hood (again, in this case, logfmt).

Going back to our previous span log example, it’s now more apparent where keys like level, span, and span_name are set. One piece of code to call out here is how we loop, for (key, value), over the values read from our storage map, lifting information we observed and collected in the previous layer.

We use this to provide context to augment our structured log events in another layer. Put another way, we’re composing specific sub-behaviors on trace data via layers in order to build out a singular subscriber to the overall trace. Field keys like http.route and http.host are lifted from this storage layer for example.

On event

impl<S, Wr, W> Layer<S> for LogFmtLayer<Wr, W>
    Wr: Write + 'static,
    W: for<'writer> MakeWriter<'writer> + 'static,
    S: Subscriber + for<'span> LookupSpan<'span>,
... // elided for brevity
    fn on_event(&self, event: &Event<'_>, ctx: Context<'_, S>) {
        let mut p = self.printer.write();
        event.record(&mut *p);
        //record source information

        ctx.lookup_current().map(|current_span| {
            let extensions = current_span.extensions();
            extensions.get::<Storage>().map(|visitor| {
                for (key, value) in visitor.values() {
                    if !ON_EVENT_SKIP_FIELDS.contains(key) {
... // elided for brevity

Though somewhat tedious, the pattern for implementing these span lifecycle methods is getting easier and easier to make out. Field key-value pairs like target and location are formatted from source information, giving us the target="application::middleware::logging" and location="src/middleware/logging.rs:354" seen previously. Keys like vendor.request_path and vendor.url are also lifted from contextual storage.

While more work would probably go into implementing any formatting specification correctly, we can now see the granular control and customization that the tracing framework provides. This contextual information is how we’ll eventually be able to form correlations within a request lifecycle.

Deriving metrics through instrumentation and span durations

Metrics, in particular, are actually quite hostile to observability on their own, and the cardinality of a metric, the number of unique combinations of metric names and dimension values, can easily be abused.

JavaScript is not available.

No Description

We’ve already shown how structured logs can be derived from events. Metrics themselves should be formed from the event, or span, that encompassed them.

We will still need out-of-band metrics like those geared around process collection (e.g., CPU usage, disk bytes written/read). However, if we’re already able to instrument our code at the function level to determine when something occurred, couldn’t some metrics “fall out for free?” As mentioned previously, we have the tooling, but we just need to thread it through.

Tracing provides accessible ways to annotate functions a user wants to instrument, which means creating, entering, and closing a span each time the annotation function executes. The rust compiler itself makes heavy use of these annotated instrumentations all over the codebase:

#[instrument(skip(self, op), level = "trace")]
pub(super) fn fully_perform_op<R: fmt::Debug, Op>(
    &mut self,
    locations: Locations,
    category: ConstraintCategory<'tcx>,
    op: Op) -> Fallible<R>

For our purposes, let’s look at a straightforward, async database save_event function that’s been instrumented with some very specific field definitions:

    level = "info",
    name = "record.save_event",
    fields(category="db", subject="aws_db", event_id = %event.event_id,
           event_type=%event.event_type, otel.kind="client", db.system="aws_db",
           metric_name="db_event", metric_label_event_table=%self.event_table_name,
​​        err(Display)
pub async fn save_event(&self, event: &Event) -> anyhow::Result<()> {

Our instrumentation function has metric prefixed fields like name, event_type, and event_table. These keys correspond to metric names and labels typically found in a Prometheus monitoring setup. We’ll come back to these prefixed fields shortly. First, let’s expand the MetricsLayer that we set up initially with some additional filters.

Essentially, these filters do two things: 1) produce metrics for all trace-log-level or higher events (even though they may not be logged to stdout based on a configured log level); and, 2) pass through events for instrumented functions with the record prefix attached, as in name = "record.save_event" above.

After this, all that’s left in order to automate metric derivation is to return to our metrics-layer implementation.

On close

const PREFIX_LABEL: &str = "metric_label_";
const METRIC_NAME: &str = "metric_name";
const OK: &str = "ok";
const ERROR: &str = "error";
const LABEL: &str = "label";
const RESULT_LABEL: &str = "result";

impl<S> Layer<S> for MetricsLayer
    S: Subscriber + for<'span> LookupSpan<'span>,
    fn on_close(&self, id: Id, ctx: Context<'_, S>) {
        let span = ctx.span(&id).expect("Span not found");
        let mut extensions = span.extensions_mut();
        let elapsed_secs_f64 = extensions
            .map(|i| i.elapsed().as_secs_f64())
        if let Some(visitor) = extensions.get_mut::<Storage>() {
            let mut labels = vec![];
            for (key, value) in visitor.values() {
                if key.starts_with(PREFIX_LABEL) {
            ... // elided for brevity
            let name = visitor
            if visitor.values().contains_key(ERROR)
                labels.push((RESULT_LABEL, String::from(ERROR)))
            } else {
                labels.push((RESULT_LABEL, String::from(OK)))
            ... // elided for brevity
            metrics::increment_counter!(format!("{}_total", name), &labels);
                format!("{}_duration_seconds", name),
            ... // elided for brevity

There are a lot of bits being pushed around in this example, and some of it is elided. Nonetheless, on_close, we always have access to the end of a span interval via elapsed_secs_f64, which can drive our histogram calculation via the metrics::histogram! macro.

Note that we’re leveraging the metrics-rs project here. Anyone can model this function the same way with another metrics library that provides counter and histogram support. From our storage map, we pull out all metric_*-labeled keys and use those to generate labels for the automatically derived incremented counter and histogram.

Also, if we’ve stored an event that errored out, we’re able to use that as part of the label, distinguishing our resulting functions based on ok/error. Given any instrumented function, we’ll derive metrics from it with this same code behavior.

The output we’d encounter from a Prometheus endpoint would show a counter that looks something like this:

db_event_total{event_table="events",event_type="Song",result="ok",span_name="save_event\"} 8

Instrumenting async closures and indirect span relationships

A question that does come up from time to time is how to instrument code with spans that reference an indirect, non-parent-to-child relationship, or what’s called a follows from reference.

This would come into play for async operations that spawn requests to side-effecting downstream services or processes that emit data to service buses where the direct response or returned output has no effect within the operation that spawned it itself.

For these cases, we can instrument async closures (or futures) directly by entering into a given span (captured as a follows_from reference below) associated with our async future each time it’s polled and exited every time the future is parked, as shown below with .instrument(process_span):

// Start a span around the context process spawn
let process_span = debug_span!(
    parent: None,
    subject = "songs.async",
    category = "songs"

    async move {
        match context.process().await {
            Ok(r) => debug!(song=?r, "successfully processed"),
            Err(e) => warn!(error=?e, "failed processing"),

OpenTelemetry interoperability for distributed tracing

Much of the usefulness of observability comes from the fact that most services today are actually made up of many microservices. We should all be thinking distributed.

If all kinds of services have to connect to one another, across networks, vendors, clouds, and even edge-oriented or local-first peers, some standards and vendor-agnostic tooling should be enforced. This is where OpenTelemetry (OTel) comes into play, and many of the known observability platforms are more than happy to ingest OTel-compliant telemetry data.

While there’s a whole suite of open source Rust tooling for working within the OTel ecosystem, many of the well-known Rust web frameworks haven’t adopted the incorporation of OTel standards in a baked-in sort of way.

Popular, encompassing web frameworks like Actix and Tokio’s axum rely on custom implementations and external libraries to provide integration (actix-web-opentelemetry and axum-tracing-opentelemetry, respectively). Third-party integration has been the favored choice thus far, and while that promotes flexibility and user control, it can make it more difficult for those expecting to add the integration almost seamlessly.

We won’t cover writing a custom implementation here in detail, but canonical HTTP middleware like Tower allows for overriding the default implementation of span creation on a request. If implemented to spec, these fields should be set on a span’s metadata:

  • http.client_ip: The client’s IP address
  • http.flavor: The protocol version used (HTTP/1.1, HTTP/2.0, etc.)
  • http.host: The value of the Host header
  • http.method: The request method
  • http.route: The matched route
  • http.request_content_length: The request content length
  • http.response_content_length: The response content length
  • http.scheme: The URI scheme used (HTTP or HTTPS)
  • http.status_code: The response status code
  • http.target: The full request target including path and query parameters
  • http.user_agent: The value of the User-Agent header
  • otel.kind: Typically server, find more here
  • otel.name: Name consisting of http.method and http.route
  • otel.status_code: OK if the response is success; ERROR if it is a 5xx
  • trace_id: The identifier for a trace, used to group all spans for a specific trace together across processes

Initializing a tracer

Tracing via tracing-opentelemetry and rust-opentelemetry exposes another layer with which we can compose our subscriber in order to add OTel contextual information to all spans and connect and emit these spans to observability platforms like Datadog or Honeycomb or directly to a running instance of Jaeger or Tempo, which can sample trace data for manageable consumption.

Initializing a [Tracer]() to produce and manage spans is pretty straightforward:

pub fn init_tracer(settings: &Otel) -> Result<Tracer> {

    let resource = Resource::new(vec![

    let api_token = MetadataValue::from_str(&settings.api_token)?;
    let endpoint = &settings.exporter_otlp_endpoint;

    let mut map = MetadataMap::with_capacity(1);
    map.insert("x-tracing-service-header", api_token);

    let trace = opentelemetry_otlp::new_pipeline()
        .with_exporter(exporter(map, endpoint)?)
        .map_err(|e| anyhow!("failed to intialize tracer: {:#?}", e))?;


Including it within our pipeline of layers is straightforward as well. We can also filter based on level and use a dynamic filter to skip events that we’d like to avoid in our traces:

        .with_filter(dynamic_filter_fn(|_metadata, ctx| {
                // Exclude the rustls session "Connection" events
                // which don't have a parent span
                .map(|s| s.parent().is_none() && s.name() == "Connection")

With this pipeline initialization, all of our application traces can be ingested by a tool like Jaeger, as we demonstrated earlier in this article. Then, all that’s left is data correlation, slicing, and dicing.


By composing these tracing layers together, we can expose system behavior information in a refined and granular way, all the while gaining enough output and enough context to start making sense of such behavior. All this customization still comes at a price: it’s not completely automatic all the way through, but the patterns are idiomatic, and there are open-source layers available to drop in for lots of normal use cases.

Of all things, this post should help make it easier for users to attempt customizing application interaction with trace collection and demonstrate how far it can go in preparing our applications to handle cases that inevitably fall between the cracks. This is just the beginning of our beautiful friendship with our events and when they happen, and thereby, observability. How we go about debugging and solving problems with it, in the long run, is always going to be ongoing work.

Special thanks

I’d like to offer my sincere thanks to Paul Cleary, Daniel Jin, Lukas Jorgensen, Jonathan Whittle, Jared Morrow, John Mumm, Sean Cribbs, Scott Fritchie, and Cuyler Jones for all their help, edits, and collaboration on this article.

Another major thank you to my team at fission.codes, especially Brian Ginsburg and Quinn Wilton, for all their backing on this, and to my advisor Frank Pfenning for putting up with me while I did some non-research work!

LogRocket: Full visibility into web frontends for Rust apps

Debugging Rust applications can be difficult, especially when users experience issues that are difficult to reproduce. If you’re interested in monitoring and tracking performance of your Rust apps, automatically surfacing errors, and tracking slow network requests and load time, try LogRocket.

LogRocket is like a DVR for web and mobile apps, recording literally everything that happens on your Rust app. Instead of guessing why problems happen, you can aggregate and report on what state your application was in when an issue occurred. LogRocket also monitors your app’s performance, reporting metrics like client CPU load, client memory usage, and more.

Modernize how you debug your Rust apps — .

Zeeshan Lakhani Cofounder and organizer of Papers We Love and PWLConf, applied researcher working on local-first software (among other things) at fission.codes, and PhD student at Carnegie Mellon University, studying Programming Language Theory.

Leave a Reply