Your logs are lying to you - metrics are meaner and better.
Everyone loves logs… until the incident postmortem reads like bad fan fiction.
Most teams start with expensive log aggregation, full-text searching their way into oblivion. So much noise. So little signal. And still, no clue what actually happened. Why? Because writing meaningful logs is a lost art.
Logs are like candles, nice for mood lighting, useless in a house fire.
If you need traces to understand your system, congratulations: you're already in hell.
Let me introduce my favourite method: real-time, metric-driven user simulation aka "Overwatch".
Here's how you do it:
Set up a service that runs real end-to-end user workflows 24/7. Use Cypress, Playwright, Selenium… your poison of choice.
Every action creates a timed metric tagged with the user workflow and action.
Now you know exactly what a user did before everything went up in flames.
Use Grafana + InfluxDB (or other tools you already use) to build dashboards that actually tell stories:
* How fast are user workflows?
* Which steps are breaking, and how often?
* What's slower today than yesterday?
* Who's affected, and where?
Alerts now mean something.
Incidents become surgical strikes, not scavenger hunts.
Bonus: run the same system on every test environment and detect regressions before deployment. And if you made it reusable, you can even run the service to do load tests.
No need to buy overpriced tools. Just build a small service like you already do, except this one might save your soul.
And yes, transform logs into metrics where possible. Just hash your PII data and move on.
Stop guessing. Start observing.
Metrics > Logs. Always.