Designing an observability stack

There isn’t a single “right” design. Different teams will make different choices based on scale, security requirements, cost constraints, and what they already run in production.

For this example, the observability design is intentionally simple. The goal is to show how Tines emits traces, how those traces move through the pipeline, and how they turn into something you can actually use in Grafana.

At a high level, the stack is built around five core components:

Tines: Generates OpenTelemetry traces
OpenTelemetry Collector: Receives traces from Tines, does light processing and sampling, then forwards them
Tempo: Stores full traces and lets you search and inspect them
Prometheus: Stores metrics derived from those traces, including RED and service-graph metrics
Grafana: The user interface layer that queries Tempo and Prometheus and powers the dashboards

Tines sits at the edge, emitting telemetry. Everything else exists to receive it, shape it, store it, and present it in a way that helps you answer questions about performance problems.

This diagram shows the high-level movement of data through our example system:

Tines exports OTEL traces
the OTEL Collector receives and processes them
Tempo stores and indexes traces for search and TraceQL
Prometheus stores span-derived metrics
Grafana queries both and ties traces and metrics together

You can find the configurations for the observability stack and Grafana Dashboard in the .zip file below:

tines-grafana-blog-v1-0-0.zip

Note on configurations provided in this article.

The configuration examples included in this article are intended to illustrate how an example OpenTelemetry stack can ingest Tines OTEL traces. They are not the only valid deployment pattern and should not be treated as production-ready configurations without thorough self-review. Different environments will use different storage backends, security controls, sampling strategies, and network topologies.

Implementing an observability stack

Tines needs to be configured to export OTEL Traces to our stack.

At a minimum you need to:

enable tracing and auto instrumentation
choose the OTLP protocol (HTTP or gRPC)
point Tines at your collector endpoint
optionally set a service name

This means setting container environment variables for tines-app and tines-sidekiq like the below example:

environment:
  # Enable OTEL
  - OTEL_ENABLED=true
  # Configure endpoint and protocol
  - OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
  - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-host.example.com:4318
  - OTEL_SERVICE_NAME=tines-app
  # Enable auto-instrumentation for detailed traces
  - OTEL_AUTO_INSTRUMENTATION=true

environment:
  # Enable OTEL
  - OTEL_ENABLED=true
  # Configure endpoint and protocol
  - OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
  - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-host.example.com:4318
  - OTEL_SERVICE_NAME=tines-sidekiq
  # Enable auto-instrumentation for detailed traces
  - OTEL_AUTO_INSTRUMENTATION=true

Docker Compose: Deploying an observability stack

To keep everything self-contained, we’re running the observability stack with Docker in a single docker-compose.yml. The following diagram shows how all of the components in the stack fit together, including what is exposed, what stays internal, and how data moves through the system.

Below is the directory structure used for this example:

tines-observability/
├─ docker-compose.yml
├─ .env                                  # .env file
├─ grafana/
│  └─ provisioning/
│     ├─ datasources/
│     │  ├─ prometheus.yml               # Grafana Prometheus Datasource
│     │  └─ tempo.yml                    # Grafana Tempo Datasource
│     └─ dashboards/
│        └─ story-troubleshooting.json   # Grafana dashboard json(s)
├─ tempo/
│  └─ tempo.yaml                         # Tempo config
├─ prometheus/
│  └─ prometheus.yml                     # Prometheus scrape config
├─ otel-collector/
│  └─ config.yaml                        # OTEL Collector config
└─ data/                                 # Bound as volumes
   ├─ grafana/                           # /var/lib/grafana
   ├─ tempo/                             # /var/tempo (trace blocks)
   └─ prometheus/                        # /prometheus (TSDB)

OTEL collector: Configuration, filtering, & exporting traces

The OpenTelemetry collector is the center of the pipeline. Tines sends traces here, the collector decides what to keep, generates metrics from those traces, and then forwards everything to the right backend.

Receivers: Where Tines sends the data

The collector exposes both OTLP endpoints:

gRPC on 4317
HTTP on 4318

You can choose either protocol in Tines, so both are enabled:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317 # GRPC Receiver
      http:
        endpoint: 0.0.0.0:4318  # HTTP Receiver

Filtering: Remove noise before sampling

Before sampling or exporting, the config drops spans you will almost never care about. In our example we’re dropping the /health endpoint, but you can filter more aggressively as needed:

  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'

Note: Tines has additional recommendations for head based sampling for auto instrumentation, as it can be extremely noisy. An efficient and useful production sampling configuration likely employs a combination of both head and tail based sampling.

  # Tail-based sampling - decides AFTER seeing complete traces
  tail_sampling:
    decision_wait: 20s  # Wait time for spans before sampling decision
    num_traces: 20000         # Max in-flight traces considered for sampling
    expected_new_traces_per_sec: 75  # Used to size internal reservoir/buffers

    policies:  # Evaluated in order, first match wins
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]  # Keep 100% of errors


      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 5000  # Keep 100% of traces >5s

      - name: fast-traces
        type: probabilistic
        probabilistic:
          sampling_percentage: 10  # Sample 10% of remaining "normal" traces

Batch & memory protections

Two processors are here largely for operational safety:

batch: send spans in groups instead of individually
memory_limiter: keep the collector from running out of memory

  # Batch - groups data before export to reduce network overhead
  batch:
    timeout: 10s             # Max time to wait before flushing a batch
    send_batch_size: 1024    # Preferred batch size (items)
    send_batch_max_size: 2048  # Hard cap on batch size

  # Memory limiter - prevents OOM by applying backpressure
  memory_limiter:
    check_interval: 1s
    limit_mib: 1536  # Soft cap for collector memory inside 2GiB container

Spanmetrics connector: Turning Traces into Metrics

The spanmetrics connector converts traces into Prometheus-queryable metrics. It does three main things:

It builds latency histograms using explicit buckets, which lets us calculate percentile latency (like P95/P99) and quickly spot slow or timing-out requests in the dashboard.
It enables exemplars, so individual metric points can link directly back to the exact trace in Tempo that caused them, making “jump to trace from spike” possible.
It adds selected dimensions (HTTP method, status code, destination host, exception type, story and action identifiers) so we can break metrics down by specific actions, destinations, or failure types without exploding cardinality.

connectors:
  spanmetrics:
    histogram:
      explicit:
        # Extended buckets for better P95/P99 accuracy and timeout detection
        buckets: [100ms, 250ms, 500ms, 1s, 2s, 5s, 7s, 10s, 12s, 15s, 20s, 30s, 45s, 60s]

    exemplars:
      enabled: true # Link metric samples back to example traces in Tempo

    metrics_flush_interval: 15s

    dimensions:
      # HTTP attributes
      - name: http.method
      - name: http.status_code

      # Network attributes
      - name: net.peer.name         # Destination hostname

      # Error attributes
      - name: exception.type        # SSL / timeout / connection error groupings

      # Tines-specific attributes
      - name: action.type           # Agent type (HTTPRequestAgent, etc.)
      - name: action.id             # Specific action within story
      - name: story_container.id    # Container/folder ID

      # Database attributes
      - name: db.system

Exporters: Where data actually goes

Traces → Tempo

  # Send traces to Tempo for storage and querying
  otlp/tempo:
    endpoint: tempo:4317
    compression: gzip  # Compress traces before sending
    sending_queue:
      enabled: true
      num_consumers: 4 # Parallel senders
      queue_size: 10000 # Buffer during Tempo hiccups
    tls:
      insecure: true  # No TLS for internal Docker network
    timeout: 30s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s  # Give up after 5 minutes

Metrics → Prometheus

  # Expose metrics for Prometheus to scrape at :8889/metrics
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: tines  # Prefix: tines_traces_span_metrics_*, tines_otelcol_*
    const_labels:
      environment: production
    resource_to_telemetry_conversion:
      enabled: true  # Promote resource attrs (service.name, etc.) to metric labels

Service pipelines: wiring it all together

At the bottom of the config, everything is connected:

service:
  telemetry:
    logs:
      level: info
    metrics:
      level: basic  # Enable otelcol_* self-metrics

  pipelines:
    # Traces: OTLP → mem limit → filter → tail sampling → batch → Tempo + spanmetrics
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter, tail_sampling, batch]
      exporters: [otlp/tempo, spanmetrics]

    # Metrics: spanmetrics + prometheus receiver → batch → Prometheus exporter
    metrics:
      receivers: [spanmetrics]
      processors: [batch]
      exporters: [prometheus]

Tempo: Trace storage

Tempo is the trace backend in this stack. The OpenTelemetry Collector sends sampled traces to Tempo over OTLP, and Tempo is responsible for storing them and making them queryable from Grafana.

This configuration does a few important things:

Tempo is configured to listen on:
- gRPC on 4317
- HTTP on 4318
  - These ports are internal to Docker. Tines never talks to Tempo directly — traces always go Tines → OTEL Collector → Tempo.
It uses local storage with WAL + block compaction
- a write-ahead log (WAL) at /var/tempo/wal
  - WAL is the short-term buffer while traces are being ingested
- immutable compressed blocks at /var/tempo/blocks
  - blocks are the long-term persisted storage
  - the compactor periodically merges and deletes old blocks
  - Retention is explicitly configured: block_retention: 168h → 7 days of traces
It runs the metrics generator
- This config enables Tempo’s built-in metrics generator. The generator consumes traces and produces:
  - service graph metrics
  - exemplars tied back to Tempo traces
- Those metrics are sent to Prometheus via remote_write: http://prometheus:9090/api/v1/write with send_exemplars: true

Prometheus: metrics from traces

In this stack, Prometheus stores metrics that are generated from traces**:**

Tines sends traces to the OTEL Collector
the Collector turns those traces into metrics via spanmetrics
Prometheus scrapes those metrics and stores them

Grafana: datasources and dashboards

Grafana is the UI layer in this stack. It does two things for us:

connects to Tempo for traces
connects to Prometheus for metrics
optionally loads prebuilt dashboards automatically

Instead of clicking everything together in the UI, we deploy Grafana declaratively so the environment is repeatable.

Datasource provisioning

Tempo datasource

type: tempo
URL: http://tempo:3200
used for:
- TraceQL search
- viewing full traces
- jumping from metrics → trace

It’s also wired so Tempo can pull metrics context from Prometheus for service maps and node graphs.

Prometheus datasource

type: prometheus
URL: http://prometheus:9090
marked as default
sample interval set to 15s to match scrape interval
exemplarTraceIdDestinations tells Grafana:
- exemplars in Prometheus metrics contain a trace ID
- clicking them should open Tempo

Dashboard provisioning

The dashboard provider configuration tells Grafana to:

look for JSON dashboards in
/etc/grafana/provisioning/dashboards/json
put them in a folder called Tines
automatically reload if files change

Read the following articles for help with: