Grafana+Prometheus 实现监控可视化

Grafana 是数据可视化展示平台,可以连接包括 Prometheus 在内的各种数据源。Prometheus 是一套监控报警和时序数据库的组合,Prometheus 主动发起TCP连接访问监控对象上面 Exporter 的端口获取信息,存储在自己的时序数据库中,Grafana 实时从 Prometheus 查询数据展示出来。

监控中数据获取有 Pull 和 Push 两大流派:

  • Pull(拉)是监控系统主动获取数据,监控系统需要服务发现机制,需要被监控对象开放端口能够被访问,一般来说监控对象配置简单、监控系统配置复杂。好处是容易发现监控对象离线,监控系统控制拉取顺序和频率,防止监控系统过载;坏处是每个监控对象都要开放端口,防火墙等安全配置复杂;麻烦是监控系统需要维护监控对象信息。
  • Push(推)是监控对象主动推送数据,监控系统被动获取,只有监控系统需要开放端口,监控对象配置稍复杂(设置推送目标)。好处是监控系统不需要维护监控对象列表,有且仅有监控系统需要开放端口,监控对象只要能网络可达监控系统即可,哪怕在多个防火墙后进行了多次NAT;坏处是监控系统可能过载,不容易发现监控对象离线;麻烦是每个监控对象都需要配置监控系统信息。

监控对象(客户端)

Node Exporter 是 Prometheus 官方提供的物理机监控客户端,注意更改版本号为最新的。

wget -q https://mirror.nju.edu.cn/github-release/prometheus/node_exporter/LatestRelease/node_exporter-1.9.0.linux-amd64.tar.gz
tar xf node_exporter-*
sudo mv node_exporter-*/node_exporter /usr/local/sbin/
sudo chown root:root /usr/local/sbin/node_exporter
rm -rf ./node_exporter-*
sudo cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Restart=always
ExecStart=/usr/local/sbin/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now node_exporter.service

DCGM-Exporter 是 NVIDIA 官方提供的GPU监控客户端,它基于DCGM。

安装 NVIDIA Data Center GPU Manager (DCGM),注意更改操作系统版本、架构、key包版本。

wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
rm -f cuda-keyring_1.1-1_all.deb
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
sudo apt-get update && sudo apt-get install datacenter-gpu-manager-4-cuda-all
systemctl enable --now nvidia-dcgm

安装 Go 然后编译 DCGM-Exporter,注意更改Go的版本

wget https://mirror.nju.edu.cn/golang/go1.24.1.linux-amd64.tar.gz
tar -C /usr/local -xzf go*.tar.gz
rm -f go*.tar.gz
export PATH=$PATH:/usr/local/go/bin
go env -w GO111MODULE=on 
go env -w GOPROXY="https://repo.nju.edu.cn/go/,direct"
git clone https://github.com/NVIDIA/dcgm-exporter
cd dcgm-exporter
make binary
cp cmd/dcgm-exporter/dcgm-exporter /usr/local/sbin/
mkdir /usr/local/etc/dcgm-exporter
cp etc/* /usr/local/etc/dcgm-exporter/
cd ..
rm -rf dcgm-exporter
cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service

[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter --collectors /usr/local/etc/dcgm-exporter/default-counters.csv

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now dcgm-exporter.service

注意 DCGM 和 DCGM-Exporter 的版本要匹配,DCGM-Exporter 的版本格式为 DCGM版本-Exporter版本。如下列 4.1.1-4.0.4 意思是 DCGM-Exporter 本身的版本是4.0.4,对 DCGM 的版本要求是 4.1.1,经查看是匹配的。

/usr/local/sbin/dcgm-exporter --version
2025/03/23 10:46:13 maxprocs: Leaving GOMAXPROCS=112: CPU quota undefined
DCGM Exporter version 4.1.1-4.0.4
(base) root@ubuntu:~# dcgmi --version

dcgmi  version: 4.1.1

监控系统(服务端)

先把 Docker 和 Docker Compose V2 装好,然后用容器部署 Grafana 和 Prometheus。简单起见不考虑前面套反向代理、Prometheus加认证、consul 实现自动服务发现注册。

先建好目录并更改所有者/组,否则启动的时候会报没有权限

mkdir grafana
mkdir prometheus
mkdir prometheus-conf
sudo chown 472:0 grafana/
sudo chown 65534:65534 prometheus

编辑 docker-compose.yml

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    restart: always
    volumes:
      - ./prometheus:/prometheus
      - ./prometheus-conf:/etc/prometheus/conf
    command:
      - --config.file=/etc/prometheus/conf/prometheus.yml
      - --web.console.libraries=/usr/share/prometheus/console_libraries
      - --web.console.templates=/usr/share/prometheus/consoles
      - --web.listen-address=0.0.0.0:9090
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.size=100GB  #数据库存储容量限制,超过后会自动删除最老的数据
      - --storage.tsdb.wal-compression
  grafana:
    image: grafana/grafana
    container_name: grafana
    restart: always
    ports:
      - 3000:3000
    volumes:
      - ./grafana:/var/lib/grafana
    environment:
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
      - GF_SECURITY_ADMIN_PASSWORD=yaoge123   #admin的初始密码
      - GF_SERVER_ENABLE_GZIP=true
      - GF_SERVER_DOMAIN=192.168.1.10   #如果前面有反代要改
      - GF_SERVER_ROOT_URL=http://192.168.1.10   #如果前面有反代要改
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_NAME=yaoge123
      #- GF_SERVER_SERVE_FROM_SUB_PATH=true   #反代后有子路径
      #- GF_SERVER_ROOT_URL=%(protocol)s://%(domain)s:%(http_port)s/grafana   #反代后有子路径
      #- GF_SECURITY_COOKIE_SECURE=true   #反代上有HTTPS
    depends_on:
     - prometheus

导入默认配置 prometheus.yml

cd prometheus-conf
wget https://raw.githubusercontent.com/prometheus/prometheus/refs/heads/main/documentation/examples/prometheus.yml

prometheus.yml 中主要编辑的是监控目标

# my global config
global:
  scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.采样间隔
  evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 30s #采样超时,有一些exporter读取很慢,需要放宽超时。
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
       # The label name is added as a label `label_name=<label_value>` to any timeseries scraped from this config.
        labels:
          app: "prometheus"

  - job_name: node_exporter
    static_configs:
    - targets: 
      - '192.168.1.101:9100'
      - '192.168.1.102:9100'
  - job_name: dcgm-exporter
    static_configs:
    - targets:
      - '192.168.1.101:9400'
      - '192.168.1.102:9400'

把容器启起来 cd .. && docker compose up -d

下面就是配置 Grafana

至此一个简单的监控平台就已经搭建好了,在真实的生产环境中使用 consul 实现被监控对象的管理是必不可少的,Prometheus 最好加上认证,Grafana 前面最好套一个 NGINX 做反代实现 HTTPS。

DNSmasq 为HPC集群外容器提供集群内主机名解析

在HPC集群中通常有DNS和本地hosts提供解析服务,以便节点间通过主机名互相通信,而不是直接使用IP地址。但是如果在集群外有一个独立服务器中的容器需要与集群内的节点通过主机名通讯,就需要通过DNS来给容器提供解析服务。

通过自动化脚本将集群的hosts拷贝到独立服务器的一个目录下,如 /home/hpc/dns/hosts

自己做一个dnsmasq的容器:

[yaoge123]$ cat dnsmasq/Dockerfile 
FROM alpine:latest
RUN apk update \
 && apk upgrade \
 && apk add --no-cache \
            dnsmasq \
 && rm -rf /var/cache/apk/*

编写docker-compose.yml:

  1. dnsmasq提供了DNS服务,需要指定ip地址,以便在下面其它容器配置中指定dns ip
  2. /home/hpc/dns 是存储hosts的本机目录
  3. 生产环境用 –keep-in-foreground,调试时用–no-daemon和–log-queries
  4. –domain-needed 一定要加,防止dnsmasq将没有域的主机名(没有.的)转发给上游DNS
  5. –cache-size= 改的比hosts文件行数多一些
  6. abc是要解析集群内主机名的容器,添加的dns就是为了用dnsmasq来提供解析服务
  7. 不要解析的就不要加dns
services:
  dnsmasq:
    build: ./dnsmasq
    image: dnsmasq
    container_name: dnsmasq
    networks:
      default:
        ipv4_address: 192.168.100.200
    volumes:
      - /home/hpc/dns:/etc/dns:ro
    command:
      - dnsmasq
      - --keep-in-foreground
        #- --no-daemon
        #- --log-queries
      - --domain-needed
      - --no-hosts
      - --cache-size=3000
      - --hostsdir=/etc/dns
  abc:
    image: abc
    container_name: abc
    dns:
      - 192.168.100.200
  …………
networks:
  default:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: 192.168.100.0/24

测试解析和查看 dnsmasq 缓存情况,evictions为0最好

[yaoge123]# run --rm -it --network=docker_default --dns=192.168.100.200 alpine sh
/ # apk add bind-tools
/ # dig +short node_name
/ # for i in "cachesize.bind insertions.bind evictions.bind misses.bind hits.bind auth.bind servers.bind";do dig +short chaos txt $i;done

使用 Singularity 容器升级固件

构建一个安装了SST的CentOS8容器

singularity build –sandbox sst-build docker://centos:8.4.2105

cp sst-*.x86_64.rpm sst-build/home

singularity shell -w sst-build

rpm -ivh /home/sst-*.x86_64.rpm

rm /home/sst-*.x86_64.rpm

exit

singularity build sst.sif sst-build

到节点上升级

singularity shell –writable-tmpfs sst.sif

sst show -ssd

sst load -ssd 0

sst load -ssd 1

……

exit

reboot

 

Prometheus + Grafana 监控 NVIDIA GPU

1.首先安装 NVIDIA Data Center GPU Manager (DCGM),从 https://developer.nvidia.com/dcgm 下载安装

nv-hostengine -t
yum erase -y datacenter-gpu-manager
rpm -ivh datacenter-gpu-manager*
systemctl enable --now dcgm.service

2. 安装 NVIDIA DCGM exporter for Prometheus,从 https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm 下载手工安装

wget -q -O /usr/local/bin/dcgm-exporter https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter
chmod +x /usr/local/bin/dcgm-exporter
mkdir /run/prometheus 
wget -q -O /etc/systemd/system/prometheus-dcgm.service https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/exporters/prometheus-dcgm/bare-metal/prometheus-dcgm.service
systemctl daemon-reload
systemctl enable --now prometheus-dcgm.service

3. 从 https://prometheus.io/download/#node_exporter 下载 node_exporter,手工安装为服务并添加 dcgm-exporter 资料

tar xf node_exporter*.tar.gz
mv node_exporter-*/node_exporter /usr/local/bin/
chown root:root /usr/local/bin/node_exporter
chmod +x /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sed -i '/ExecStart=\/usr\/local\/bin\/node_exporter/c\ExecStart=\/usr\/local\/bin\/node_exporter --collector.textfile.directory=\/run\/prometheus' /etc/systemd/system/node_exporter.service

systemctl daemon-reload
systemctl enable --now node_exporter.service

4. Grafana 添加这个Dashboard
https://grafana.com/grafana/dashboards/11752

Docker 自动更新 Let’s Encrypt

在nginx的 docker run 中添加webroot和配置文件挂载

-v $PWD/nginx/letsencrypt/:/var/www/letsencrypt:ro \
-v $PWD/letsencrypt/etc/:/etc/nginx/letsencrypt/:ro \

在nginx中将wwwroot发布出去

location ^~ /.well-known/ {
    root /var/www/letsencrypt/;
}

在nginx中配置证书文件

ssl_certificate letsencrypt/live/www.yaoge123.com/fullchain.pem;
ssl_certificate_key letsencrypt/live/www.yaoge123.com/privkey.pem;

创建 certbot 的docker run脚本,以后只要周期性运行这个脚本就可以自动更新证书了

#!/bin/sh
cd $(dirname $0)
pwd

docker run -it --rm \
	-v $PWD/letsencrypt/etc:/etc/letsencrypt \
	-v $PWD/letsencrypt/lib:/var/lib/letsencrypt \
	-v $PWD/letsencrypt/log:/var/log/letsencrypt \
	-v $PWD/nginx/letsencrypt:/var/www \
	certbot/certbot \
	certonly --webroot \
	--email yaoge123@example.com --agree-tos --no-eff-email \
	--webroot-path=/var/www/ \
	-n \
	--domains www.yaoge123.com
docker kill --signal=HUP nginx

CentOS 7 YUM 安装 Cacti

先添加EPEL再用yum安装cacti和中文字体

yum install cacti cacti-spine mariadb-server google-noto-sans-simplified-chinese-fonts

编辑 /etc/httpd/conf.d/cacti.conf ,在 Directory /usr/share/cacti/ 中添加可访问的浏览器客户端

编辑 /etc/cron.d/cacti ,去掉注释

编辑 /etc/spine.conf ,注释RDB_*

创建数据库

[root@yaoge123]# mysqladmin --user=root create cacti

创建数据库用户

[root@yaoge123]# mysql --user=root mysql
MariaDB [mysql]> GRANT ALL ON cacti.* TO cactiuser@localhost IDENTIFIED BY 'cactiuser';
MariaDB [mysql]> flush privileges;

数据库用户增加 timezone 权限

[root@yaoge123]# mysql -u root
MariaDB [(none)]> GRANT SELECT ON mysql.time_zone_name TO cactiuser@localhost IDENTIFIED BY 'cactiuser';
MariaDB [(none)]> flush privileges;

数据库增加 timezone

[root@yaoge123]# mysql_tzinfo_to_sql /usr/share/zoneinfo/ | mysql -u root mysql

新建一个文件 /etc/my.cnf.d/cacti.cnf ,内容供参考根据实际情况修改

[mysqld]
character-set-client = utf8mb4
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
innodb_additional_mem_pool_size = 80M
innodb_buffer_pool_size = 1024M
innodb_doublewrite = ON
innodb_file_format = Barracuda
innodb_file_per_table = ON
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_large_prefix = ON
join_buffer_size = 748M
max_allowed_packet = 16777216
max_heap_table_size = 374M
tmp_table_size = 374M

重启相关服务,设置开机自动启动

systemctl restart mariadb
systemctl enable mariadb
systemctl restart httpd
systemctl enable httpd

导入数据库

[root@yaoge123]# mysql cacti < /usr/share/doc/cacti-*/cacti.sql

浏览器打开 http://<server>/cacti/ ,默认用户名密码为 admin/admin

HPE ProLiant DL380 Gen10 不同BIOS设置内存性能测试

硬件环境

2*Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz
12*HPE SmartMemory DDR4-2666 RDIMM 16GiB

iLO 5 1.37 Oct 25 2018
System ROM U30 v1.46 (10/02/2018)
Intelligent Platform Abstraction Data 7.2.0 Build 30
System Programmable Logic Device 0x2A
Power Management Controller Firmware 1.0.4
NVMe Backplane Firmware 1.20
Power Supply Firmware 1.00
Power Supply Firmware 1.00
Innovation Engine (IE) Firmware 0.1.6.1
Server Platform Services (SPS) Firmware 4.0.4.288
Redundant System ROM U30 v1.42 (06/20/2018)
Intelligent Provisioning 3.20.154
Power Management Controller FW Bootloader 1.1
HPE Smart Storage Battery 1 Firmware 0.60
HPE Eth 10/25Gb 2p 631FLR-SFP28 Adptr 212.0.103001
HPE Ethernet 1Gb 4-port 331i Adapter – NIC 20.12.41
HPE Smart Array P816i-a SR Gen10 1.65
HPE 100Gb 1p OP101 QSFP28 x16 OPA Adptr 1.5.2.0.0
HPE InfiniBand EDR/Ethernet 100Gb 2-port 840QSF 12.22.40.30
Embedded Video Controller 2.5

软件环境

CentOS Linux release 7.6.1810 (Core)
Linux yaoge123 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Memory Latency Checker – v3.6

Continue reading

OpenLDAP 升级报错pwdMaxRecordedFailure不存在

升级到OpenLDAP 2.4.44,出现以下错误

User Schema load failed for attribute "pwdMaxRecordedFailure". Error code 17: attribute type undefined
config error processing olcOverlay={1}ppolicy,olcDatabase={2}hdb,cn=config: User Schema load failed for attribute "pwdMaxRecordedFailure". Erro...ype undefined
slapd stopped.

解决办法

cd /etc/openldap/slapd.d/cn=config/cn=schema
mv cn\=\{3\}ppolicy.ldif cn\=\{3\}ppolicy.ldif.bak
mv /etc/openldap/schema/ppolicy.ldif cn\=\{3\}ppolicy.ldif

 

OpenLDAP 密码策略

OpenLDAP默认是没有密码检查策略的,123456这也得密码也能接受,这显然是管理员不希望看到的。

  1. 导入密码策略schema
    ldapadd -Y EXTERNAL -H ldapi:/// -D "cn=config" -f /etc/openldap/schema/ppolicy.ldif
  2. 加载模块,因为已经添加过syncprov模块了,所以只要追加ppolicy模块就可以了
    dn: cn=module{0},cn=config
    changetype: modify
    add: olcModuleLoad
    olcModuleLoad: ppolicy.la
    
    ldapmodify -Y EXTERNAL -H ldapi:/// -f mod_ppolicy.ldif
  3. 指定默认策略dn名
    dn: olcOverlay=ppolicy,olcDatabase={2}hdb,cn=config
    changeType: add
    objectClass: olcOverlayConfig
    objectClass: olcPPolicyConfig
    olcOverlay: ppolicy
    olcPPolicyDefault: cn=default,ou=ppolicy,dc=yaoge123,dc=com
    olcPPolicyHashCleartext: TRUE
    ldapmodify -Y EXTERNAL -H ldapi:/// -f ppolicy.ldif
  4. 创建默认策略
    dn: ou=ppolicy,dc=yaoge123,dc=com
    objectClass: organizationalUnit
    objectClass: top
    ou: ppolicy
    
    dn: cn=default,ou=ppolicy,dc=yaoge123,dc=com
    cn: default
    objectClass: top
    objectClass: device
    objectClass: pwdPolicy
    objectClass: pwdPolicyChecker
    pwdAllowUserChange: TRUE
    pwdAttribute: userPassword
    pwdCheckQuality: 2
    pwdExpireWarning: 604800
    pwdFailureCountInterval: 0
    pwdGraceAuthnLimit: 5
    pwdInHistory: 5
    pwdLockout: TRUE
    pwdLockoutDuration: 600
    pwdMaxAge: 0
    pwdMaxFailure: 5
    pwdMinAge: 0
    pwdMinLength: 8
    pwdMustChange: FALSE
    pwdSafeModify: FALSE
    pwdCheckModule: check_password.so
    ldapadd -Y EXTERNAL -H ldapi:/// -f defaultppolicy.ldif
  5. 修改/etc/openldap/check_password.conf,定义check_password.so规则
  6. MirrorMode的两台LDAP均需进行上述同样的配置