Comment by dangoodmanUT

dangoodmanUT Mar 9, 2024 parent

> we were getting to the point where it took a full-time engineer to build an observability layer around Temporal

We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?

What obstacles did you experience here?

abelanger Mar 9, 2024

Well, for one - most otel services (like Honeycomb) are designed around aggregate views, and engineers found it difficult to track down the failure of specific workflows. We were already using Sentry, had started adding prom + grafana into our stack, and were already using mezmo for logging. So to debug a workflow, we'd see an alert come in through Sentry, grab the workflow ID and activity ID, perform a search in the Temporal console, track down the failed activity (of which there could be between 1-100 activities), and associate that with our logs in mezmo (involving a new query syntax). This is a lot of raw data that takes time to parse and figure out what's going wrong. And then we wanted to build out a view of worker health, which involves a new set of dashboards and alerts that are different from our error alerting in Sentry.

Yes, this sounded broken to us too - we were aware of the promise of consolidation with an opentelemetry and a Grafana stack, but we couldn't make this transition happen cleanly, and when you're already relying on certain tools for your API it makes the transition more difficult. There's also upskilling involved in getting engineers on the team to adjust to otel when they're used to more intuitive tools like sentry and mezmo.

A good set of default metrics, better search, and views for worker performance and pools - that would have gone a long way. The extent of Temporal UI features are basic recent workflows, an expanded workflow view with stack traces for thrown errors, a schedules page, and a settings page.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous