I have a story in my Tines instance that i’ve noticed has been running erratically, erroring, and overall slow.
Based on the above problem, we should be able to answer the following:
What are the overall RED metrics for my story?
What Action(s) and Destination(s) are encountering errors or exceptions in my story?
What Action(s) are consistently slow in my story?
How can I easily see what is consuming HTTP request run times?
Overall: RED metrics
This is pretty easy to see from the overview section. We can clearly see that my story, over the last hour, has had an extremely high error rate and high average request duration.
We can see that my request volume over time has been consistent, with spikes correlating to when the story is scheduled to run.
We can also see that the story has had a consistent error rate each run, meaning there is likely a set of or singular action(s) that are having issues. The high request duration percentiles also indicates that one or more actions are running slowly.
Actions & destinations: Errors & exceptions
Now I need to see which actions and destinations are the most problematic in my story. This Trend graph over time shows that I am consistently receiving 4 main errors:
4XX errors
5XX errors
SSL Errors
Timeout Errors
So which actions are the problem?
I can see that action ID 2057 is my highest error count, clicking into the Trace link for that action, I can easily find out that the action is generating a lot of 504 errors.
Checking the action logs in my story confirms that this action is receiving 504 errors, so I can now investigate deeper as to why!
Similarly with destinations, I can see that self-signed.badssl.com & expired.badssl.com are receiving exceptions. Clicking into expired.badssl.com and drilling into a span shows me that the service for this destination is returning an expired certificate. Within the span attributes we can then see the action id for this problematic action is 2033.
Actions: High latency
Now that i’ve identified all the errors in my story, I want to find which actions are slowing down my story execution. Looking at the performance analysis section shows me my average latency is high, but particularly so for tines-mbp.local.
Drilling into the span for the slowest action, I can see that the majority of the time spent for action 2035’s execution was actually just the http request to the service and timed out. We can see prior and subsequent operations (like database calls) were very quick!
HTTP requests: Where is the time spent
I’m also suspicious that a normally fast set of actions has been particularly slow. I’ve confirmed that the route and endpoint work very quickly normally. In this case, I can check the slow tines-app DB queries in sidekiq jobs panel to check if there any requests are slow because of the database, and not the http request. We can see I have a few spans that fit the criteria!
Drilling into the span, we can see the majority of this HTTP POST request was actually spent inside of the database, and not the POST itself!
Continuing on from the example in this article, you can find the configurations for the observability stack and Grafana dashboard in the .zip file below:











