Monitoring a container environment is important for a number of reasons:
- Resource Utilisation: Containers run isolated processes and have their own set of resources such as CPU, memory, and storage. Monitoring these resources can help identify and troubleshoot issues related to overutilisation, underutilisation, and resource contention.
- Performance: Monitoring the performance of your containers and the underlying infrastructure can help you identify and fix bottlenecks, such as slow network or disk I/O. This can help improve the overall performance of your containerised applications.
- Availability: Monitoring the availability of your containers and services can help you quickly identify and resolve issues that may impact the availability of your applications.
- Scalability: Monitoring the performance and resource utilization of your container environment can help you identify when it’s time to scale up or down to meet the changing needs of your applications.
- Compliance and Security: Container environments are often used to host sensitive data and applications, it’s important to ensure that they are configured securely and in compliance with industry regulations. Monitoring can help you identify and resolve any issues related to security and compliance.
- Cost optimisation: Monitoring your container environment can also help you optimise costs by identifying and stopping underutilised resources.
Essentially, monitoring your container environment is important to ensure the reliability, performance, and security of your applications, as well as to identify and resolve any issues before they impact your users.
This is where Prometheus and Grafana come in…..
Prometheus is a popular open-source monitoring and alerting system. It is primarily used to monitor the performance of various services and systems in a distributed environment.
Prometheus uses a pull-based model, where the Prometheus server periodically scrapes metrics from the monitored services and stores them in a time-series database.
This allows for easy querying and alerting based on the collected metrics.
One of the key features of Prometheus is its powerful query language, PromQL, which allows for powerful querying and analysis of the collected metrics. This can be used to create custom dashboards and alerts for specific use cases.
Prometheus also has a built-in alerting system, which can be configured to send
notifications based on specific conditions of the collected metrics. This can
be used to alert on things like high CPU usage or slow response times.
Prometheus is highly extensible and has a large ecosystem of exporters and integrations that can be used to monitor a wide variety of systems and services, including Kubernetes, Docker, amongst others.
In essence, Prometheus is a powerful and flexible monitoring and alerting tool that is well suited for use in distributed systems and cloud environments.
So what about visualisation and dashboards……
Prometheus dashboards are typically created using Grafana, a popular open-source visualization and dashboard tool.
Grafana is an open-source visualization and monitoring tool that allows you to create and share dashboards and alerts. It is often used in conjunction with time-series databases such as Prometheus, InfluxDB, and Elasticsearch to display and analyse metrics and log data.
Grafana provides a web-based interface that allows you to create custom dashboards with a variety of visualisations, including line charts, bar charts, and heat maps.
You can also create alerts that notify you when specific conditions are met, such as when a metric exceeds a certain threshold.
Grafana supports a wide range of data sources, including Prometheus, InfluxDB, Elasticsearch, Graphite, amongst others. This allows you to easily connect to and visualize data from multiple sources in a single dashboard.
Grafana also includes the ability to run a powerful query editor (as mentioned above), which allows you to write complex queries using PromQL (Prometheus Query Language) and other query languages to extract data from your data sources.
It’s also worth noting that Grafana has a huge community and is widely used in industry. It has a wide range of plugins, dashboards and alerting options that can be easily integrated into different environments.
So, show me how to get this spun-up…..
Below is an example of how to create a simple dashboard in Grafana to display some basic metrics from a Prometheus server:
- First, install and configure Grafana to use the Prometheus data source.
- Create a new dashboard by clicking on the “New Dashboard” button in the Grafana menu.
- Add a new panel to the dashboard by clicking on the “Add Panel” button.
- In the “Metrics” tab, select the Prometheus data source and use PromQL to query the desired metric. For example, to display the average CPU usage across all nodes in a Kubernetes cluster, for example by using the following query:
- You can add multiple panels with different metrics and customize the visualization of the data. You can also create alerting rule to set condition to trigger alerts.
Another example of a PromQL query to get the number of HTTP requests received by an service exposed on port 8080:
You can also use built-in Prometheus metrics to monitor the health and performance of the Prometheus server itself, such as the number of scrapes, the number of samples ingested, and the memory usage of the server.
prometheus_scrape_samples_scrape prometheus_scrape_duration_seconds prometheus_local_storage_memory_chunks
It’s worth noting that Prometheus and Grafana are highly flexible and can be used to monitor and visualize a wide variety of metrics and systems. The examples above should give you a good starting point for creating your own dashboards and metrics, but you should explore other options as well.
So, what about if I have, for some reason, multiple Promethus instances and wish to query across those?
This is where Thanos comes in….
Thanos is an open-source project that provides a set of components for extending the functionality of Prometheus. It is designed to be used in large-scale, highly-available Prometheus deployments, where the storage and querying of metrics can become a bottleneck.
The main feature of Thanos is its ability to provide a global query view across multiple Prometheus instances. This allows you to aggregate metrics from multiple Prometheus servers into a single, unified view, making it easier to monitor and troubleshoot large, distributed systems.
Thanos also provides a number of other features, including:
- High availability: Thanos can automatically replicate and distribute data across multiple instances, providing a highly-available and fault-tolerant storage solution.
- Data retention: Thanos allows you to configure how long data should be retained, and automatically prunes old data to save disk space.
- Downsampling: Thanos can automatically downsample data to reduce the amount of disk space needed to store it, while still preserving the ability to query and analyse it.
- Long-term storage: Thanos supports storing data in object storage like S3, Google Cloud Storage, Azure Storage Accounts (Blob) etc, allowing you to keep data for longer period of time
Thanos is designed to be used in conjunction with Prometheus, and is typically deployed alongside a Prometheus server. It can also be integrated with other monitoring and visualization tools, such as Grafana, to provide a unified view of metrics across multiple systems.
How do I configure Thanos to work with my distributed Prometheus environment, and what does the architecture look like, that is for another day…
“The Grafana Labs Marks are trademarks of Grafana Labs, and are used with Grafana Labs’ permission. We are not affiliated with, endorsed or sponsored by Grafana Labs or its affiliates.”
“Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.”