Designing an observability stack
There isn’t a single “right” design. Different teams will make different choices based on scale, security requirements, cost constraints, and what they already run in production.
For this example, the observability design is intentionally simple. The goal is to show how Tines emits traces, how those traces move through the pipeline, and how they turn into something you can actually use in Grafana.
At a high level, the stack is built around five core components:
Tines: Generates OpenTelemetry traces
OpenTelemetry Collector: Receives traces from Tines, does light processing and sampling, then forwards them
Tempo: Stores full traces and lets you search and inspect them
Prometheus: Stores metrics derived from those traces, including RED and service-graph metrics
Grafana: The user interface layer that queries Tempo and Prometheus and powers the dashboards
Tines sits at the edge, emitting telemetry. Everything else exists to receive it, shape it, store it, and present it in a way that helps you answer questions about performance problems.
This diagram shows the high-level movement of data through our example system:
Tines exports OTEL traces
the OTEL Collector receives and processes them
Tempo stores and indexes traces for search and TraceQL
Prometheus stores span-derived metrics
Grafana queries both and ties traces and metrics together
You can find the configurations for the observability stack and Grafana Dashboard in the .zip file below:
Note on configurations provided in this article.
The configuration examples included in this article are intended to illustrate how an example OpenTelemetry stack can ingest Tines OTEL traces. They are not the only valid deployment pattern and should not be treated as production-ready configurations without thorough self-review. Different environments will use different storage backends, security controls, sampling strategies, and network topologies.
Implementing an observability stack
Tines needs to be configured to export OTEL Traces to our stack.
At a minimum you need to:
enable tracing and auto instrumentation
choose the OTLP protocol (HTTP or gRPC)
point Tines at your collector endpoint
optionally set a service name
This means setting container environment variables for tines-app and tines-sidekiq like the below example:
environment:
# Enable OTEL
- OTEL_ENABLED=true
# Configure endpoint and protocol
- OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-host.example.com:4318
- OTEL_SERVICE_NAME=tines-app
# Enable auto-instrumentation for detailed traces
- OTEL_AUTO_INSTRUMENTATION=true
environment:
# Enable OTEL
- OTEL_ENABLED=true
# Configure endpoint and protocol
- OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-host.example.com:4318
- OTEL_SERVICE_NAME=tines-sidekiq
# Enable auto-instrumentation for detailed traces
- OTEL_AUTO_INSTRUMENTATION=true
Docker Compose: Deploying an observability stack
To keep everything self-contained, we’re running the observability stack with Docker in a single docker-compose.yml. The following diagram shows how all of the components in the stack fit together, including what is exposed, what stays internal, and how data moves through the system.
Below is the directory structure used for this example:
tines-observability/
├─ docker-compose.yml
├─ .env # .env file
├─ grafana/
│ └─ provisioning/
│ ├─ datasources/
│ │ ├─ prometheus.yml # Grafana Prometheus Datasource
│ │ └─ tempo.yml # Grafana Tempo Datasource
│ └─ dashboards/
│ └─ story-troubleshooting.json # Grafana dashboard json(s)
├─ tempo/
│ └─ tempo.yaml # Tempo config
├─ prometheus/
│ └─ prometheus.yml # Prometheus scrape config
├─ otel-collector/
│ └─ config.yaml # OTEL Collector config
└─ data/ # Bound as volumes
├─ grafana/ # /var/lib/grafana
├─ tempo/ # /var/tempo (trace blocks)
└─ prometheus/ # /prometheus (TSDB)
OTEL collector: Configuration, filtering, & exporting traces
The OpenTelemetry collector is the center of the pipeline. Tines sends traces here, the collector decides what to keep, generates metrics from those traces, and then forwards everything to the right backend.
Receivers: Where Tines sends the data
The collector exposes both OTLP endpoints:
gRPC on 4317
HTTP on 4318
You can choose either protocol in Tines, so both are enabled:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # GRPC Receiver
http:
endpoint: 0.0.0.0:4318 # HTTP Receiver
Filtering: Remove noise before sampling
Before sampling or exporting, the config drops spans you will almost never care about. In our example we’re dropping the /health endpoint, but you can filter more aggressively as needed:
filter:
traces:
span:
- 'attributes["http.target"] == "/health"'
Note: Tines has additional recommendations for head based sampling for auto instrumentation, as it can be extremely noisy. An efficient and useful production sampling configuration likely employs a combination of both head and tail based sampling.
# Tail-based sampling - decides AFTER seeing complete traces
tail_sampling:
decision_wait: 20s # Wait time for spans before sampling decision
num_traces: 20000 # Max in-flight traces considered for sampling
expected_new_traces_per_sec: 75 # Used to size internal reservoir/buffers
policies: # Evaluated in order, first match wins
- name: errors
type: status_code
status_code:
status_codes: [ERROR] # Keep 100% of errors
- name: slow-traces
type: latency
latency:
threshold_ms: 5000 # Keep 100% of traces >5s
- name: fast-traces
type: probabilistic
probabilistic:
sampling_percentage: 10 # Sample 10% of remaining "normal" traces
Batch & memory protections
Two processors are here largely for operational safety:
batch: send spans in groups instead of individuallymemory_limiter: keep the collector from running out of memory
# Batch - groups data before export to reduce network overhead
batch:
timeout: 10s # Max time to wait before flushing a batch
send_batch_size: 1024 # Preferred batch size (items)
send_batch_max_size: 2048 # Hard cap on batch size
# Memory limiter - prevents OOM by applying backpressure
memory_limiter:
check_interval: 1s
limit_mib: 1536 # Soft cap for collector memory inside 2GiB container
Spanmetrics connector: Turning Traces into Metrics
The spanmetrics connector converts traces into Prometheus-queryable metrics. It does three main things:
It builds latency histograms using explicit buckets, which lets us calculate percentile latency (like P95/P99) and quickly spot slow or timing-out requests in the dashboard.
It enables exemplars, so individual metric points can link directly back to the exact trace in Tempo that caused them, making “jump to trace from spike” possible.
It adds selected dimensions (HTTP method, status code, destination host, exception type, story and action identifiers) so we can break metrics down by specific actions, destinations, or failure types without exploding cardinality.
connectors:
spanmetrics:
histogram:
explicit:
# Extended buckets for better P95/P99 accuracy and timeout detection
buckets: [100ms, 250ms, 500ms, 1s, 2s, 5s, 7s, 10s, 12s, 15s, 20s, 30s, 45s, 60s]
exemplars:
enabled: true # Link metric samples back to example traces in Tempo
metrics_flush_interval: 15s
dimensions:
# HTTP attributes
- name: http.method
- name: http.status_code
# Network attributes
- name: net.peer.name # Destination hostname
# Error attributes
- name: exception.type # SSL / timeout / connection error groupings
# Tines-specific attributes
- name: action.type # Agent type (HTTPRequestAgent, etc.)
- name: action.id # Specific action within story
- name: story_container.id # Container/folder ID
# Database attributes
- name: db.system
Exporters: Where data actually goes
Traces → Tempo
# Send traces to Tempo for storage and querying
otlp/tempo:
endpoint: tempo:4317
compression: gzip # Compress traces before sending
sending_queue:
enabled: true
num_consumers: 4 # Parallel senders
queue_size: 10000 # Buffer during Tempo hiccups
tls:
insecure: true # No TLS for internal Docker network
timeout: 30s
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s # Give up after 5 minutes
Metrics → Prometheus
# Expose metrics for Prometheus to scrape at :8889/metrics
prometheus:
endpoint: "0.0.0.0:8889"
namespace: tines # Prefix: tines_traces_span_metrics_*, tines_otelcol_*
const_labels:
environment: production
resource_to_telemetry_conversion:
enabled: true # Promote resource attrs (service.name, etc.) to metric labels
Service pipelines: wiring it all together
At the bottom of the config, everything is connected:
service:
telemetry:
logs:
level: info
metrics:
level: basic # Enable otelcol_* self-metrics
pipelines:
# Traces: OTLP → mem limit → filter → tail sampling → batch → Tempo + spanmetrics
traces:
receivers: [otlp]
processors: [memory_limiter, filter, tail_sampling, batch]
exporters: [otlp/tempo, spanmetrics]
# Metrics: spanmetrics + prometheus receiver → batch → Prometheus exporter
metrics:
receivers: [spanmetrics]
processors: [batch]
exporters: [prometheus]
Tempo: Trace storage
Tempo is the trace backend in this stack. The OpenTelemetry Collector sends sampled traces to Tempo over OTLP, and Tempo is responsible for storing them and making them queryable from Grafana.
This configuration does a few important things:
Tempo is configured to listen on:
gRPC on
4317HTTP on
4318These ports are internal to Docker. Tines never talks to Tempo directly — traces always go Tines → OTEL Collector → Tempo.
It uses local storage with WAL + block compaction
a write-ahead log (WAL) at
/var/tempo/walWAL is the short-term buffer while traces are being ingested
immutable compressed blocks at
/var/tempo/blocksblocks are the long-term persisted storage
the compactor periodically merges and deletes old blocks
Retention is explicitly configured:
block_retention: 168h→ 7 days of traces
It runs the metrics generator
This config enables Tempo’s built-in metrics generator. The generator consumes traces and produces:
service graph metrics
exemplars tied back to Tempo traces
Those metrics are sent to Prometheus via
remote_write:http://prometheus:9090/api/v1/writewithsend_exemplars: true
Prometheus: metrics from traces
In this stack, Prometheus stores metrics that are generated from traces**:**
Tines sends traces to the OTEL Collector
the Collector turns those traces into metrics via spanmetrics
Prometheus scrapes those metrics and stores them
Grafana: datasources and dashboards
Grafana is the UI layer in this stack. It does two things for us:
connects to Tempo for traces
connects to Prometheus for metrics
optionally loads prebuilt dashboards automatically
Instead of clicking everything together in the UI, we deploy Grafana declaratively so the environment is repeatable.
Datasource provisioning
Tempo datasource
type:
tempoURL:
http://tempo:3200used for:
TraceQL search
viewing full traces
jumping from metrics → trace
It’s also wired so Tempo can pull metrics context from Prometheus for service maps and node graphs.
Prometheus datasource
type:
prometheusmarked as default
sample interval set to
15sto match scrape intervalexemplarTraceIdDestinations tells Grafana:
exemplars in Prometheus metrics contain a trace ID
clicking them should open Tempo
Dashboard provisioning
The dashboard provider configuration tells Grafana to:
look for JSON dashboards in
/etc/grafana/provisioning/dashboards/jsonput them in a folder called Tines
automatically reload if files change
Read the following articles for help with:

