Designing a story troubleshooting dashboard

The goal of the Tines Story Troubleshooting Dashboard is simple:

Make it obvious which parts of a story are slow, erroring, or noisy.

This dashboard is designed to troubleshoot HTTP-centric Tines stories. It combines span-derived metrics from Prometheus with raw traces from Tempo. You can filter to a single story, drill down into specific destinations or actions, and jump directly from panels into full traces. It focuses on answering: what is slow, what is failing, and whether the issue lives in Tines execution, a particular action, or an external destination.

The dashboard is organized around three ideas:

Story-level health
- high-level RED metrics
- total requests, errors, duration percentiles
- quick read on “is this story generally healthy?”
Action-level performance
- latency by action
- error rate by action
- outlier actions that only occasionally spike
Drill-down paths
- click → jump directly into Tempo traces

Breaking down the dashboard

1. Overview

The Overview section gives a fast, high-level read on the health of a single Tines story. It surfaces core RED metrics such as request rate, error volume, error percentage, average duration, and P99 latency. The intent is simple: tell you, within a few seconds, whether the story is generally healthy or exhibiting failure or latency symptoms.

2. Trends over time

This section focuses on how behavior changes across time, not just point-in-time snapshots. You’ll see throughput, success and error rates, and latency percentiles plotted across the selected time window. This makes it easy to spot patterns such as gradual performance decay, sudden regressions after a deploy, or spikes that correlate with increased traffic. It also helps distinguish sustained latency problems from rare tail events.

3. Error analysis

The Error Analysis section is designed to answer a single question: what is failing, and where? Errors are broken down by HTTP status class, action, and destination. This helps isolate whether failures are concentrated in one integration, one destination, or one specific action. From here, the dashboard also exposes direct trace pivot links, letting you jump from “error count increased here” into the exact traces responsible for those failures.

4. Performance analysis

This section shifts the focus from failure to slowness. It highlights the actions and destinations contributing the most to overall latency using percentile-based views such as P95 and P99. You can see whether performance degradation is systemic or isolated to a handful of expensive operations. It also surfaces slow-request samples, allowing you to hop straight into slow traces rather than hunting manually in Tempo.

5. Volume & activity

The Volume & Activity panels help explain load characteristics of a story. They show which actions generate the most requests and how frequently specific destinations are called. This is particularly useful for identifying noisy polling actions, unexpected loops, or integrations that are being exercised far more than expected.

6. Raw events & DB slow query panels

This section pulls in recent trace samples, not just aggregated metrics. You’ll see tables of the latest erroring requests, slow requests based on your defined threshold, and relevant database-related spans. These provide direct, click-through entry points into Tempo, allowing you to move straight from metric-level symptoms into span-level root cause investigation without guesswork.

Continuing on from the example in this article, you can find the configurations for the observability stack and Grafana dashboard in the .zip file below:

tines-grafana-blog-v1-0-0.zip

Read the following articles for help with: