If we are load-testing a system, we should also be able to validate that the system is functioning appropriately. Observing metrics from the load-source is well-understood, observing the behaviors of Substrate are not:
- Actors can be suspended which makes prometheus-style scraping a problem (push-based may help here if we flush metrics before sleep, but that interferes with batching/sampling)
- how do we attribute resource consumption to an individual actor? Do we need to?
- Can most benchmarking metrics be satisfied via scraping atelets/API server to validate things like # of successful suspend/resume?
Logs are already persistent, are there vital metrics we should push out to logs? Do we want an easy way of re-aggregating these metrics.
Example uses:
- Validate total # of bytes written/overwritten (or FDs opened or network requests, etc.)
- Aggregate non-fatal error counts (e.g. a long-running process attempts to poll an endpoint but times out)
- Observe latency of requests as seen by actor
- Aggregate request statistics when agent is acting as a server
If we are load-testing a system, we should also be able to validate that the system is functioning appropriately. Observing metrics from the load-source is well-understood, observing the behaviors of Substrate are not:
Logs are already persistent, are there vital metrics we should push out to logs? Do we want an easy way of re-aggregating these metrics.
Example uses: