Event monitoring and notification subsystem
🌐 This document is available in both English and Ukrainian. Use the language toggle in the top right corner to switch between versions. |
1. Overview
The Event monitoring and notification subsystem is responsible for collecting monitoring data from applications, and infrastructure stores information as a time series (time series
). It offers capabilities for constructing search queries and allows the creation of alerts based on these queries.
2. Subsystem functions
-
Collection and storage of monitoring data from Platform and Registry components.
-
Generation of alerts based on monitoring data.
-
Visualization based on monitoring data.
-
Provision of a unified interface for metrics collection.
3. Technical design
The following diagram depicts the components of the Event monitoring and notification subsystem and their interaction with other subsystems.
The Event monitoring and notification subsystem can be expanded with custom alert receivers (Alertmanager Receivers) and the creation of custom dashboards in addition to the default ones.
4. Subsystem Components
Component Name | Namespace | Deployment | Origin | Repository | Purpose |
---|---|---|---|---|---|
Platform Monitoring Web Interface |
|
|
3rd-party |
Visualization and access to monitoring data |
|
Image Rendering Extension |
|
|
3rd-party |
Grafana extension for generating PNG images during chart and graph export |
|
Alert Service |
|
|
3rd-party |
Notification system for administrators during anomalies or issues with Platform or Registry components |
|
Prometheus Operator |
|
|
3rd-party |
Monitoring component providing configuration, deployment, and maintenance of Prometheus monitoring subsystem components |
|
Monitoring Subsystem Operator |
|
|
3rd-party |
Configuration, deployment, and maintenance of the Platform’s monitoring subsystem in OpenShift. |
|
Prometheus Query Service |
|
|
3rd-party |
Aggregates and deduplicates primary orchestration platform metrics of OpenShift and Registry metrics under a multi-user interface. |
|
Virtual Machine Metrics Exporters |
|
|
3rd-party |
Collection of metrics from the Platform and Registry virtual machines |
|
Monitoring Service |
|
|
3rd-party |
Collection and storage of Platform and Registry component metrics. Central component on which the Event and Alert Monitoring Subsystem is based. Prometheus is a time-series database and metric rule engine. It also sends alerts to Alertmanager for processing. |
|
k8s Object Monitoring Service |
|
|
3rd-party |
Collects metrics related to the state of resources and objects of the Kubernetes API server in the container orchestration platform. |
|
OpenShift Object Monitoring Service |
|
|
3rd-party |
Collects metrics related to the state of resources and objects of the OpenShift API server in the container orchestration platform. |
|
OKD Cluster Scaling Metric Integration Component |
|
|
3rd-party |
Metrics transfer between Prometheus and the container orchestration platform’s auto-scaling components. |
5. Grafana dashboards
The dashboards listed below are installed immediately upon installation of the Registries Platform.
This set allows Platform and registry administrators to track component performance over time and identify potential problems before they become critical.
Dashboard |
Technical name |
Owner subsystem |
Purpose |
Camunda dashboard |
camunda-metrics |
Allows visibility into general metrics of business process execution and user tasks (message exchange, deletion of historical data) |
|
Ceph dashboard |
ceph |
Allows general Ceph state metrics analysis and its components (current status, volumes of free and occupied storage, and performance metrics). |
|
etcd dashboard |
etcd |
Allows viewing general etcd storage metrics of the OKD container orchestration platform (leader election statistics by the RAFT algorithm, current status, and storage size). |
|
OpenShift Cluster Metrics dashboard |
cluster-total |
General metrics for using the OKD container orchestration platform resources. It provides detailed metrics about the CPU, RAM, network, and disk load of the OpenShift cluster. |
|
Java Management Extensions dashboard |
jmx |
Displays metrics related to Java applications running in the container orchestration platform. |
|
Spring Boot dashboard |
springboot |
In addition to the JMX panel, it shows spring boot metrics, namely the number, response time, and errors of HTTP requests, cache usage, and other valuable metrics for analyzing the operation of Spring Boot applications. |
|
General Kubernetes dashboard |
k8s-resources-cluster |
This tool enables the analysis of OpenShift cluster state metrics and resource usage at a cluster level. |
|
Kubernetes Namespace Level dashboard |
k8s-resources-namespace |
Allows analysis of general resource usage metrics at the namespace level. |
|
Kubernetes Virtual Machine Level dashboard |
k8s-resources-node |
Allows analysis of general resource usage metrics at the level of an individual virtual machine. |
|
Kubernetes Pod Level dashboard |
k8s-resources-pod |
Allows analysis of general resource usage metrics at the level of individual pods. |
|
Kubernetes Deployment Type dashboard |
k8s-resources-workload |
Allows analysis of general resource usage metrics with the possibility of filtering by specific deployment types in Kubernetes (deployments, jobs, daemonsets, statefulsets). |
|
Kubernetes Deployment Types Dashboard |
k8s-resources-workload-namespace |
Allows analysis of general resource usage metrics at the level of individual deployment types in Kubernetes, presented at the namespace level. |
|
Kubernetes Pod Level Dashboard |
namespace-by-pod |
Provides a comprehensive overview of pod resource usage metrics at the namespace level. |
|
Kubernetes Disk Level Dashboard |
volume-load |
Allows analysis of general disk storage usage metrics at the cluster and virtual machine levels. |
|
Kubernetes Cluster Level Dashboard |
node-cluster-rsrc-use |
Displays general metrics of the entire cluster. |
|
Kubernetes Cluster Level Dashboard |
node-rsrc-use |
Displays general metrics of the entire cluster with the possibility of filtering by individual virtual machines. |
|
Network Dashboard |
pod-total |
Displays metrics of current traffic between pods in individual namespaces. |
|
Asynchronous Message Exchange Subsystem Dashboard |
kafka-data |
The Kafka data dashboard is designed to display metrics related to the operation of Kafka brokers and consumers in the cluster. |
|
Strimzi Asynchronous Message Exchange Subsystem Dashboard |
strimzi-kafka |
Designed to display metrics related to the operation of Kafka brokers and consumers in the cluster. |
|
User and Role Management Subsystem Dashboard |
keycloak-metrics |
Displays Keycloak metrics broken down by Realms with the possibility to filter by Keycloak instances. |
|
User and Role Management Subsystem Dashboard |
keycloak-x-microprofile-metrics |
Designed to display metrics related to the operation of the Java metrics component of Keycloak. |
|
PostgreSQL Database Dashboard |
postgresql-db |
Provides detailed information about the PostgreSQL database instance. |
|
PostgreSQL Queries Dashboard |
postgresql-queries |
Provides additional information about queries. |
|
Public API Dashboard for External Traffic Management Subsystem of the Operational Zone Registry |
kong-public-api |
External Traffic Management Subsystem of the Operational Zone Registry |
Allows viewing requests for each public search condition and their quantity, request execution trend, and performance statistics. |
Monitoring, Events, and Notifications Subsystem Dashboard |
prometheus |
Allows monitoring of the status and performance of monitoring subsystem components. |
|
Registry Analytical Reporting Subsystem Dashboard |
redash |
Provides statistics on queries in the Redash component. |
|
Non-Relational Database Management Subsystem Dashboard |
redis |
Provides information about a specific Redis cluster. |
|
PostgreSQL Backup Dashboard |
crunchy-pgbackrest |
Provides information about the general status of pgBackRest backups. |
|
Detailed PostgreSQL Pod Dashboard |
crunchy-pod-details |
Provides information about resource usage by specific pods used by the PostgreSQL cluster. |
|
Detailed PostgreSQL Dashboard |
crunchy-postgresql-details |
Provides more information about a specific PostgreSQL cluster. It includes many critical PostgreSQL-specific metrics. |
|
Overview PostgreSQL Dashboard |
crunchy-postgresql-overview |
Provides an overview of all PostgreSQL clusters deployed on the Platform. |
|
PostgreSQL Service Dashboard |
crunchy-postgresql-service-health |
Contains information about Kubernetes services located in front of PostgreSQL Pods. It allows us to get information about network status. |
|
PostgreSQL Queries Dashboard |
crunchy-query-statistics |
Provides information about the overall query performance. |
6. Technology stack
The following technologies were used during the design and development of the subsystem:
7. Subsystem quality attributes
7.1. Scalability
The event monitoring and notification subsystem is designed with horizontal scaling in mind to support large clusters and high volumes of metrics from the Platform and registries.
7.2. Reliability
The event monitoring and notification subsystem employs stable and reliable components, such as Prometheus, Grafana, and Alertmanager, to provide accurate and consistent solutions for monitoring the Platform and registries and analyzing the collected metrics.
7.3. Extensibility
The event monitoring and notification subsystem offers flexible mechanisms and extension points for custom dashboards or custom notification channels that aren’t supported by default (e.g., telegram, etc.).