Skip to main content

OpenTelemetry: Designing and Implementing an Observability Stack

Learn to design and implement an observability stack

Written by Jamie Gaynor
Updated today

Designing an observability stack

There isn’t a single “right” design. Different teams will make different choices based on scale, security requirements, cost constraints, and what they already run in production.

For this example, the observability design is intentionally simple. The goal is to show how Tines emits traces, how those traces move through the pipeline, and how they turn into something you can actually use in Grafana.

At a high level, the stack is built around five core components:

  • Tines: Generates OpenTelemetry traces

  • OpenTelemetry Collector: Receives traces from Tines, does light processing and sampling, then forwards them

  • Tempo: Stores full traces and lets you search and inspect them

  • Prometheus: Stores metrics derived from those traces, including RED and service-graph metrics

  • Grafana: The user interface layer that queries Tempo and Prometheus and powers the dashboards

Tines sits at the edge, emitting telemetry. Everything else exists to receive it, shape it, store it, and present it in a way that helps you answer questions about performance problems.

This diagram shows the high-level movement of data through our example system:

  • Tines exports OTEL traces

  • the OTEL Collector receives and processes them

  • Tempo stores and indexes traces for search and TraceQL

  • Prometheus stores span-derived metrics

  • Grafana queries both and ties traces and metrics together

You can find the configurations for the observability stack and Grafana Dashboard in the .zip file below:

Note on configurations provided in this article.

The configuration examples included in this article are intended to illustrate how an example OpenTelemetry stack can ingest Tines OTEL traces. They are not the only valid deployment pattern and should not be treated as production-ready configurations without thorough self-review. Different environments will use different storage backends, security controls, sampling strategies, and network topologies.

Implementing an observability stack

Tines needs to be configured to export OTEL Traces to our stack.

At a minimum you need to:

  • enable tracing and auto instrumentation

  • choose the OTLP protocol (HTTP or gRPC)

  • point Tines at your collector endpoint

  • optionally set a service name

This means setting container environment variables for tines-app and tines-sidekiq like the below example:

environment:
# Enable OTEL
- OTEL_ENABLED=true
# Configure endpoint and protocol
- OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-host.example.com:4318
- OTEL_SERVICE_NAME=tines-app
# Enable auto-instrumentation for detailed traces
- OTEL_AUTO_INSTRUMENTATION=true

environment:
# Enable OTEL
- OTEL_ENABLED=true
# Configure endpoint and protocol
- OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-host.example.com:4318
- OTEL_SERVICE_NAME=tines-sidekiq
# Enable auto-instrumentation for detailed traces
- OTEL_AUTO_INSTRUMENTATION=true

Docker Compose: Deploying an observability stack

To keep everything self-contained, we’re running the observability stack with Docker in a single docker-compose.yml. The following diagram shows how all of the components in the stack fit together, including what is exposed, what stays internal, and how data moves through the system.

Below is the directory structure used for this example:

tines-observability/
├─ docker-compose.yml
├─ .env # .env file
├─ grafana/
│ └─ provisioning/
│ ├─ datasources/
│ │ ├─ prometheus.yml # Grafana Prometheus Datasource
│ │ └─ tempo.yml # Grafana Tempo Datasource
│ └─ dashboards/
│ └─ story-troubleshooting.json # Grafana dashboard json(s)
├─ tempo/
│ └─ tempo.yaml # Tempo config
├─ prometheus/
│ └─ prometheus.yml # Prometheus scrape config
├─ otel-collector/
│ └─ config.yaml # OTEL Collector config
└─ data/ # Bound as volumes
├─ grafana/ # /var/lib/grafana
├─ tempo/ # /var/tempo (trace blocks)
└─ prometheus/ # /prometheus (TSDB)

OTEL collector: Configuration, filtering, & exporting traces

The OpenTelemetry collector is the center of the pipeline. Tines sends traces here, the collector decides what to keep, generates metrics from those traces, and then forwards everything to the right backend.

Receivers: Where Tines sends the data

The collector exposes both OTLP endpoints:

  • gRPC on 4317

  • HTTP on 4318

You can choose either protocol in Tines, so both are enabled:

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # GRPC Receiver
http:
endpoint: 0.0.0.0:4318 # HTTP Receiver

Filtering: Remove noise before sampling

Before sampling or exporting, the config drops spans you will almost never care about. In our example we’re dropping the /health endpoint, but you can filter more aggressively as needed:

  filter:
traces:
span:
- 'attributes["http.target"] == "/health"'

Note: Tines has additional recommendations for head based sampling for auto instrumentation, as it can be extremely noisy. An efficient and useful production sampling configuration likely employs a combination of both head and tail based sampling.

  # Tail-based sampling - decides AFTER seeing complete traces
tail_sampling:
decision_wait: 20s # Wait time for spans before sampling decision
num_traces: 20000 # Max in-flight traces considered for sampling
expected_new_traces_per_sec: 75 # Used to size internal reservoir/buffers

policies: # Evaluated in order, first match wins
- name: errors
type: status_code
status_code:
status_codes: [ERROR] # Keep 100% of errors


- name: slow-traces
type: latency
latency:
threshold_ms: 5000 # Keep 100% of traces >5s

- name: fast-traces
type: probabilistic
probabilistic:
sampling_percentage: 10 # Sample 10% of remaining "normal" traces

Batch & memory protections

Two processors are here largely for operational safety:

  • batch: send spans in groups instead of individually

  • memory_limiter: keep the collector from running out of memory

  # Batch - groups data before export to reduce network overhead
batch:
timeout: 10s # Max time to wait before flushing a batch
send_batch_size: 1024 # Preferred batch size (items)
send_batch_max_size: 2048 # Hard cap on batch size

# Memory limiter - prevents OOM by applying backpressure
memory_limiter:
check_interval: 1s
limit_mib: 1536 # Soft cap for collector memory inside 2GiB container

Spanmetrics connector: Turning Traces into Metrics

The spanmetrics connector converts traces into Prometheus-queryable metrics. It does three main things:

  1. It builds latency histograms using explicit buckets, which lets us calculate percentile latency (like P95/P99) and quickly spot slow or timing-out requests in the dashboard.

  2. It enables exemplars, so individual metric points can link directly back to the exact trace in Tempo that caused them, making “jump to trace from spike” possible.

  3. It adds selected dimensions (HTTP method, status code, destination host, exception type, story and action identifiers) so we can break metrics down by specific actions, destinations, or failure types without exploding cardinality.

connectors:
spanmetrics:
histogram:
explicit:
# Extended buckets for better P95/P99 accuracy and timeout detection
buckets: [100ms, 250ms, 500ms, 1s, 2s, 5s, 7s, 10s, 12s, 15s, 20s, 30s, 45s, 60s]

exemplars:
enabled: true # Link metric samples back to example traces in Tempo

metrics_flush_interval: 15s

dimensions:
# HTTP attributes
- name: http.method
- name: http.status_code

# Network attributes
- name: net.peer.name # Destination hostname

# Error attributes
- name: exception.type # SSL / timeout / connection error groupings

# Tines-specific attributes
- name: action.type # Agent type (HTTPRequestAgent, etc.)
- name: action.id # Specific action within story
- name: story_container.id # Container/folder ID

# Database attributes
- name: db.system

Exporters: Where data actually goes

Traces → Tempo

  # Send traces to Tempo for storage and querying
otlp/tempo:
endpoint: tempo:4317
compression: gzip # Compress traces before sending
sending_queue:
enabled: true
num_consumers: 4 # Parallel senders
queue_size: 10000 # Buffer during Tempo hiccups
tls:
insecure: true # No TLS for internal Docker network
timeout: 30s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s # Give up after 5 minutes

Metrics → Prometheus

  # Expose metrics for Prometheus to scrape at :8889/metrics
prometheus:
endpoint: "0.0.0.0:8889"
namespace: tines # Prefix: tines_traces_span_metrics_*, tines_otelcol_*
const_labels:
environment: production
resource_to_telemetry_conversion:
enabled: true # Promote resource attrs (service.name, etc.) to metric labels

Service pipelines: wiring it all together

At the bottom of the config, everything is connected:

service:
telemetry:
logs:
level: info
metrics:
level: basic # Enable otelcol_* self-metrics

pipelines:
# Traces: OTLP → mem limit → filter → tail sampling → batch → Tempo + spanmetrics
traces:
receivers: [otlp]
processors: [memory_limiter, filter, tail_sampling, batch]
exporters: [otlp/tempo, spanmetrics]

# Metrics: spanmetrics + prometheus receiver → batch → Prometheus exporter
metrics:
receivers: [spanmetrics]
processors: [batch]
exporters: [prometheus]

Tempo: Trace storage

Tempo is the trace backend in this stack. The OpenTelemetry Collector sends sampled traces to Tempo over OTLP, and Tempo is responsible for storing them and making them queryable from Grafana.

This configuration does a few important things:

  • Tempo is configured to listen on:

    • gRPC on 4317

    • HTTP on 4318

      • These ports are internal to Docker. Tines never talks to Tempo directly — traces always go Tines → OTEL Collector → Tempo.

  • It uses local storage with WAL + block compaction

    • a write-ahead log (WAL) at /var/tempo/wal

      • WAL is the short-term buffer while traces are being ingested

    • immutable compressed blocks at /var/tempo/blocks

      • blocks are the long-term persisted storage

      • the compactor periodically merges and deletes old blocks

      • Retention is explicitly configured: block_retention: 168h → 7 days of traces

  • It runs the metrics generator

    • This config enables Tempo’s built-in metrics generator. The generator consumes traces and produces:

      • service graph metrics

      • exemplars tied back to Tempo traces

    • Those metrics are sent to Prometheus via remote_write: http://prometheus:9090/api/v1/write with send_exemplars: true

Prometheus: metrics from traces

In this stack, Prometheus stores metrics that are generated from traces**:**

  • Tines sends traces to the OTEL Collector

  • the Collector turns those traces into metrics via spanmetrics

  • Prometheus scrapes those metrics and stores them

Grafana: datasources and dashboards

Grafana is the UI layer in this stack. It does two things for us:

  • connects to Tempo for traces

  • connects to Prometheus for metrics

  • optionally loads prebuilt dashboards automatically

Instead of clicking everything together in the UI, we deploy Grafana declaratively so the environment is repeatable.

Datasource provisioning

Tempo datasource

  • type: tempo

  • used for:

    • TraceQL search

    • viewing full traces

    • jumping from metrics → trace

It’s also wired so Tempo can pull metrics context from Prometheus for service maps and node graphs.

Prometheus datasource

  • type: prometheus

  • marked as default

  • sample interval set to 15s to match scrape interval

  • exemplarTraceIdDestinations tells Grafana:

    • exemplars in Prometheus metrics contain a trace ID

    • clicking them should open Tempo

Dashboard provisioning

The dashboard provider configuration tells Grafana to:

  • look for JSON dashboards in

    /etc/grafana/provisioning/dashboards/json

  • put them in a folder called Tines

  • automatically reload if files change

Read the following articles for help with:

Did this answer your question?