Grafana 是数据可视化展示平台,可以连接包括 Prometheus 在内的各种数据源。Prometheus 是一套监控报警和时序数据库的组合,Prometheus 主动发起TCP连接访问监控对象上面 Exporter 的端口获取信息,存储在自己的时序数据库中,Grafana 实时从 Prometheus 查询数据展示出来。
监控中数据获取有 Pull 和 Push 两大流派:
- Pull(拉)是监控系统主动获取数据,监控系统需要服务发现机制,需要被监控对象开放端口能够被访问,一般来说监控对象配置简单、监控系统配置复杂。好处是容易发现监控对象离线,监控系统控制拉取顺序和频率,防止监控系统过载;坏处是每个监控对象都要开放端口,防火墙等安全配置复杂;麻烦是监控系统需要维护监控对象信息。
- Push(推)是监控对象主动推送数据,监控系统被动获取,只有监控系统需要开放端口,监控对象配置稍复杂(设置推送目标)。好处是监控系统不需要维护监控对象列表,有且仅有监控系统需要开放端口,监控对象只要能网络可达监控系统即可,哪怕在多个防火墙后进行了多次NAT;坏处是监控系统可能过载,不容易发现监控对象离线;麻烦是每个监控对象都需要配置监控系统信息。
监控对象(客户端)
Node Exporter 是 Prometheus 官方提供的物理机监控客户端,注意更改版本号为最新的。
wget -q https://mirror.nju.edu.cn/github-release/prometheus/node_exporter/LatestRelease/node_exporter-1.9.0.linux-amd64.tar.gz
tar xf node_exporter-*
sudo mv node_exporter-*/node_exporter /usr/local/sbin/
sudo chown root:root /usr/local/sbin/node_exporter
rm -rf ./node_exporter-*
sudo cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Restart=always
ExecStart=/usr/local/sbin/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now node_exporter.service
DCGM-Exporter 是 NVIDIA 官方提供的GPU监控客户端,它基于DCGM。
安装 NVIDIA Data Center GPU Manager (DCGM),注意更改操作系统版本、架构、key包版本。
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
rm -f cuda-keyring_1.1-1_all.deb
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
sudo apt-get update && sudo apt-get install datacenter-gpu-manager-4-cuda-all
systemctl enable --now nvidia-dcgm
安装 Go 然后编译 DCGM-Exporter,注意更改Go的版本
wget https://mirror.nju.edu.cn/golang/go1.24.1.linux-amd64.tar.gz
tar -C /usr/local -xzf go*.tar.gz
rm -f go*.tar.gz
export PATH=$PATH:/usr/local/go/bin
go env -w GO111MODULE=on
go env -w GOPROXY="https://repo.nju.edu.cn/go/,direct"
git clone https://github.com/NVIDIA/dcgm-exporter
cd dcgm-exporter
make binary
cp cmd/dcgm-exporter/dcgm-exporter /usr/local/sbin/
mkdir /usr/local/etc/dcgm-exporter
cp etc/* /usr/local/etc/dcgm-exporter/
cd ..
rm -rf dcgm-exporter
cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service
[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter --collectors /usr/local/etc/dcgm-exporter/default-counters.csv
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now dcgm-exporter.service
注意 DCGM 和 DCGM-Exporter 的版本要匹配,DCGM-Exporter 的版本格式为 DCGM版本-Exporter版本。如下列 4.1.1-4.0.4 意思是 DCGM-Exporter 本身的版本是4.0.4,对 DCGM 的版本要求是 4.1.1,经查看是匹配的。
/usr/local/sbin/dcgm-exporter --version
2025/03/23 10:46:13 maxprocs: Leaving GOMAXPROCS=112: CPU quota undefined
DCGM Exporter version 4.1.1-4.0.4
(base) root@ubuntu:~# dcgmi --version
dcgmi version: 4.1.1
监控系统(服务端)
先把 Docker 和 Docker Compose V2 装好,然后用容器部署 Grafana 和 Prometheus。简单起见不考虑前面套反向代理、Prometheus加认证、consul 实现自动服务发现注册。
先建好目录并更改所有者/组,否则启动的时候会报没有权限
mkdir grafana
mkdir prometheus
mkdir prometheus-conf
sudo chown 472:0 grafana/
sudo chown 65534:65534 prometheus
编辑 docker-compose.yml
services:
prometheus:
image: prom/prometheus
container_name: prometheus
restart: always
volumes:
- ./prometheus:/prometheus
- ./prometheus-conf:/etc/prometheus/conf
command:
- --config.file=/etc/prometheus/conf/prometheus.yml
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --web.listen-address=0.0.0.0:9090
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.size=100GB #数据库存储容量限制,超过后会自动删除最老的数据
- --storage.tsdb.wal-compression
grafana:
image: grafana/grafana
container_name: grafana
restart: always
ports:
- 3000:3000
volumes:
- ./grafana:/var/lib/grafana
environment:
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
- GF_SECURITY_ADMIN_PASSWORD=yaoge123 #admin的初始密码
- GF_SERVER_ENABLE_GZIP=true
- GF_SERVER_DOMAIN=192.168.1.10 #如果前面有反代要改
- GF_SERVER_ROOT_URL=http://192.168.1.10 #如果前面有反代要改
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_NAME=yaoge123
#- GF_SERVER_SERVE_FROM_SUB_PATH=true #反代后有子路径
#- GF_SERVER_ROOT_URL=%(protocol)s://%(domain)s:%(http_port)s/grafana #反代后有子路径
#- GF_SECURITY_COOKIE_SECURE=true #反代上有HTTPS
depends_on:
- prometheus
导入默认配置 prometheus.yml
cd prometheus-conf
wget https://raw.githubusercontent.com/prometheus/prometheus/refs/heads/main/documentation/examples/prometheus.yml
prometheus.yml 中主要编辑的是监控目标
# my global config
global:
scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.采样间隔
evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout: 30s #采样超时,有一些exporter读取很慢,需要放宽超时。
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
# The label name is added as a label `label_name=<label_value>` to any timeseries scraped from this config.
labels:
app: "prometheus"
- job_name: node_exporter
static_configs:
- targets:
- '192.168.1.101:9100'
- '192.168.1.102:9100'
- job_name: dcgm-exporter
static_configs:
- targets:
- '192.168.1.101:9400'
- '192.168.1.102:9400'
把容器启起来 cd .. && docker compose up -d
下面就是配置 Grafana
- 用 admin 和默认密码(GF_SECURITY_ADMIN_PASSWORD)登录Grafana
- 进入 /connections/datasources/prometheus 添加已有 Prometheus 数据源:Connection: http://prometheus:9090
- 进入 /dashboard/import 导入 Node Exporter Full:Grafana Dashboard ID 1860
- 下载 https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/refs/heads/main/grafana/dcgm-exporter-dashboard.json ,进入 /dashboard/import 导入该json
至此一个简单的监控平台就已经搭建好了,在真实的生产环境中使用 consul 实现被监控对象的管理是必不可少的,Prometheus 最好加上认证,Grafana 前面最好套一个 NGINX 做反代实现 HTTPS。