Observability At-Scale

If we are load-testing a system, we should also be able to validate that the system is functioning appropriately. Observing metrics from the load-source is well-understood, observing the behaviors of Substrate are not:

- Actors can be suspended which makes prometheus-style scraping a problem (push-based may help here if we flush metrics before sleep, but that interferes with batching/sampling)
- how do we attribute resource consumption to an individual actor? Do we need to?
- Can most benchmarking metrics be satisfied via scraping atelets/API server to validate things like # of successful suspend/resume?

Logs are already persistent, are there vital metrics we should push out to logs? Do we want an easy way of re-aggregating these metrics.

Example uses:
- Validate total # of bytes written/overwritten (or FDs opened or network requests, etc.)
- Aggregate non-fatal error counts (e.g. a long-running process attempts to poll an endpoint but times out)
- Observe latency of requests as seen by actor
- Aggregate request statistics when agent is acting as a server


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability At-Scale #174

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Observability At-Scale #174

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions