Configuration
Overview
There is an option to enable Grafana
Alloy on all switches to forward metrics and logs to the configured targets using
Prometheus Remote-Write
API and Loki API. Metrics includes port speeds, counters,
errors, operational status, transceivers, fans, power supplies, temperature
sensors, BGP neighbors, LLDP neighbors, and more. Logs include Hedgehog agent
logs, switch syslog, pod logs for all services running in the control node k8s cluster. Modify the URL as needed, instead of /api/v1/push it could be
/api/v1/write; check the documentation for the data provider.
Switches push telemetry data through a proxy running in a pod on the control node. Switches do not have direct access to the Internet. Configure the control node to be able to reach and resolve the location of the Prometheus and Loki servers.
Telemetry can be enabled after installation of the fabric. There are two YAML objects that control the telemetry configuration. The first YAML object configures the credentials and URL for the collectors. The second configures which metrics are sent via Grafana Alloy.
Grafana Cloud Configuration
Tokens
Grafana Cloud manages read and write permissions with policies. In order to
send metrics to the prometheus or loki a policy for your realm needs to be created.
When creating the policy ensure that it has at least logs:write and metrics:write permission
selected. After the policy is created, create a token under that policy. Ensure that
the token is appropriately named and time limited. Depending on your
environment separate tokens for log writing and metric writing might be
advisable. For additional details see the
documentation
Billing
Users are advised to use the regular expressions to limit the amount of data sent to grafana cloud as the costs compound quickly.
Add Credentials
Take the tokens created on grafana cloud and populate them in this YAML file. The username is different between prometheus and loki. Apply the setting below for telemetry to be pushed to the specified Prometheus and Loki instances:
- Common label for all targets, "env" is a well-known label used in dashboards
- Can be any name of your choosing
- Extra labels applied to a specific target
- Can be any name of your choosing
To apply these changes to the fabric use the following command:
Gateway Observability
| gateway.yaml | |
|---|---|
- Alloy is configured to use the prometheus.exporter.unix component
- This lists the enabled collectors
To apply these changes to the fabric use the following command:
Fabric Observability
This example shows how to configure the collection of data from the fabric switches.
- The Hedgehog agent generates information from the ASIC ports and switch configuration
- This option mirrors the prometheus.relabel component
- This is a regular expression over all the metric names that come from prometheus
- Alloy is configured to use the prometheus.exporter.unix component
- This lists the enabled collectors
- This option mirrors the prometheus.relabel component
- Regular expression over the metric names that come the metricsCollectors
To apply these changes to the fabric use the following command:
Users are encouraged to read the Grafana Alloy Docs on relabeling to ensure the desired metrics are selected. By default all metrics are sent to the collectors.Alerting
The alert rule queries the increase of the
fabric_agent_agent_heartbeats_total metric. In normal operation the switch agent sends two
increments every minute. The prometheus
increase function will extrapolate
the value for the total time range which leads to a higher reported number
than is actually observed, this is not a concern. Select a value for the Alert
condition according to your operational needs. The example has a value of 3,
which allows for some delays and drops before firing the alarm.
For convenience here is the JSON used to configure this alarm. Values that should be changed to match your environment contain the string "Hedgehog".
Grafana has a learning journey to assist users in creating and configuring alerts.
