![]() ![]() This outage led us to define which most important metrics are exposed by node exporter for this specific system, find the patterns, monitor them, and finally implement proper alerts in real-time in case instability is detected in one metric value.Īs a first step for monitoring and alerting, we defined two simple metrics as the ones which we consider the most valuable for this specific case. Figure 5: Representation of System Load node exporter metric in GrafanaĪfter a proper debriefing of what could have been done to prevent/mitigate the business impacts, it was clearly visible that since the defect until the business felt the impact, the Network Traffic and the System Load metrics were unstable and increasing slowly (represented in red on both images). In Figure 5 it’s clear we have a slow increase of the System Load ( node_load n) over the time since the deployment until the hotfix, which was not properly monitored and, in the end, led to business impact. Figure 4: Representation of Network Traffic node exporter metrics in GrafanaĪlso, the System Load was affected by the defect. The defect was adding input traffic in the system and slowly consuming all the system resources, time that the business felt the impact, and later an outage, which forced infra teams to add more resources to the VM, until the issue was fixed. This defect was not noticed until the business started to complain about performance issues almost 3 weeks later. If the metric diverges from the pattern, it is a signal that system health is decreasing, which with real-time alerting, gives us time to analyze, find and fix the issue, even before any business impact is felt, like performance degradation or outages.įigure 4 represents the behavior of the metrics Network Traffic ( node_network_receive_bytes_total, node_network_transmit_bytes_total) from the deployment where a defect was introduced on 17/09 until its resolution on 13/10. Node exporter metrics give patterns over time that allow us to determine if something is not going well in the system. In the past, there was a specific event in which we realized that, if we were monitoring node exporter metrics, the issue could have been prevented and business impacts mitigated. ![]() Node Exporter was introduced as one of the monitoring tools for the VMs of our on-premises business applications. Figure 3: Representation of an Advanced Grafana dashboard with node exporter metrics An Advanced template to go deeper into the analysis when an issue is raised on the application system performance or health. A Basic Template that is used to monitor the health and performance of the VMs of our applications which are displayed on our TV screens: Figure 2: Representation of a Basic Grafana dashboard with node exporter metricsĢ. On La Redoute, we started with two templates.ġ. To have the node exporter metrics on Grafana, there’s no need to create dashboards from scratch it’s easy to find some Grafana Templates that everyone can use and adapt. In La Redoute, we use Grafana linked to Prometheus datasource for the visual dashboard of node exporter metrics. Figure 1: Node Exporter metric example in Prometheus Graphical User Interface. Node exporter metrics can then be used for several purposes, like visual dashboarding, real-time monitoring, and/or alerting. It acts as a layer between the Prometheus server and the system in which the application is hosted to collect metrics. It’s an open-source Prometheus exporter which collects and exposes hardware and OS metrics like CPU, System Load, RAM, Network Traffic, Disk, etc. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |