Skip to content

Observability At-Scale #174

@maxsmythe

Description

@maxsmythe

If we are load-testing a system, we should also be able to validate that the system is functioning appropriately. Observing metrics from the load-source is well-understood, observing the behaviors of Substrate are not:

  • Actors can be suspended which makes prometheus-style scraping a problem (push-based may help here if we flush metrics before sleep, but that interferes with batching/sampling)
  • how do we attribute resource consumption to an individual actor? Do we need to?
  • Can most benchmarking metrics be satisfied via scraping atelets/API server to validate things like # of successful suspend/resume?

Logs are already persistent, are there vital metrics we should push out to logs? Do we want an easy way of re-aggregating these metrics.

Example uses:

  • Validate total # of bytes written/overwritten (or FDs opened or network requests, etc.)
  • Aggregate non-fatal error counts (e.g. a long-running process attempts to poll an endpoint but times out)
  • Observe latency of requests as seen by actor
  • Aggregate request statistics when agent is acting as a server

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/dev-infraprio/P0Highest priority / required for next milestone

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions