Lenovo DSS-G 是联想将GPFS集成的软硬件一体化设备,作为一套存储监控非常重要,等到丢数据的时候就追悔莫及了。
组件数据库
DSS-G 使用组件数据库(compDB )来描述硬件组件,如机柜、DSS-G的服务器和JBOD。使用 DSS-G 不需要compDB,但是健康监控需要。
第一步:生成组件数据库
始终使用dry run来生成组件数据库(compDB),其中dssg01是DSS-G两台服务器的node class(dssgmkstorage时创建的)
[root@dss01 ~]# dssgmkcompdb --racktype RACK42U --verbose --dryrun -N dssg01
DSS-G 5.0c
Parsing options: --racktype RACK42U --verbose --dryrun -N dssg01 --
Entering verbose mode
Entering dry run mode
Using provided nodeclass or node list
dss01
dss02
Checking whether all nodes belong to the same cluster...
Using rack type "RACK42U" of height 42U
Checking server model...
…………
# DSS-G210-1
Setting componentDB...
Note: all commands below support the --dry-run option
DRYRUN: mmaddcomp -F /root/yaoge123gpfs.io01-comp.2025-03-15.233334.851103.stanza --replace
DRYRUN: mmchcomploc -F /root/yaoge123gpfs.io01-comploc.2025-03-15.233334.851103.stanza
DRYRUN: /root/yaoge123gpfs.io01-dispid.2025-03-15.233334.851103.sh
All done
第二步:检查两个stanza文件内容,其中组件安装在机柜中的具体位置在 -comploc..stanza 文件中
第三步:导入组件数据库,照抄dssgmkcompdb最后面三行DRYRUN后面的命令运行,然后用mmlscomp和mmlscomploc检查结果。
[root@dss01 ~]# mmlscomp
Rack Components
Comp ID Part Number Serial Number Name
------- ----------- ------------- ----
6 RACK42U ******* A1
Server Components
Comp ID Part Number Serial Number Name Node Number
------- ----------- ------------- ---------------- -----------
9 7D9ECTOLWW ******** SR655V3-******** 101
10 7D9ECTOLWW ******** SR655V3-******** 102
Storage Enclosure Components
Comp ID Part Number Serial Number Name Display ID
------- ----------- ------------- -------------- ----------
8 7DAHCT0LWW ******** D4390-********
Storage Server Components
Comp ID Part Number Serial Number Name
------- ----------- ------------- ----------
7 DSS-G210 DSS-G210-1
[root@dss01 ~]# mmlscomploc
Rack Location Component
---- -------- ----------------
A1 U35-36 SR655V3-********
A1 U33-34 SR655V3-********
A1 U01-04 D4390-********
Storage Server Index Component
-------------- ----- ----------------
DSS-G210-1 3 SR655V3-********
DSS-G210-1 2 SR655V3-********
DSS-G210-1 1 D4390-********
设置DSS-G监控
因为DSS-G本身的冗余性(节点冗余、链路冗余等),出现故障不一定会导致整个文件系统失效,这往往会导致故障没有被立刻发现,但是这些故障会破坏冗余性或导致性能下降,管理员应该知晓这些故障并积极修复。dssghealthmon 它会周期性(默认1小时)检查整个DSS-G的健康状态,如有故障会自动发送邮件通知。dssghealthmon 可以通过confluent节点带外访问XCC(服务器的BMC)检查硬件健康状态,所以需要DSS-G服务器到Confluent的免密码SSH。
第一步:编辑配置文件 /etc/dssg/dssghealthmon.conf
- contactEmail: 故障通知的Email,必填项。
- 其它不想填就都注释了
第二步:先查看一下DSS-G的nodeclass名称,然后启动监控
[root@dss01 ~]# mmlsnodeclass
Node Class Name Members
--------------------- -----------------------------------------------------------
dssg01 dss01,dss02
[root@dss01 ~]# dssghealthmon_startup dssg01 confluent
Obtaining Confluent version from management server confluent
3.11.1
Processing nodeclass...
Node Class Name Members
--------------------- -----------------------------------------------------------
dssg01 dss01,dss02
Parsing configuration file...
Copying configuration file...
Creating tuple file...
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********
Copying tuple file...
Setting or replacing the dssghealthmon_erflist cronjob...
Warning: dssghealthmon_erflist cronjob has NOT been specified
Creating and copying daemon environment file...
Starting dssghealthmond...
The dssghealthmon system has been successfully started
第三步:在两个服务器上检查监控状态和配置文件
[root@dss01 ~]# dssghealthmon_status dssg01
Processing nodeclass...
Node Class Name Members
--------------------- -----------------------------------------------------------
dssg01 dss01,dss02
Obtaining status of the DSS-G health monitor...
dss01: active
dss02: active
[root@dss01 ~]# systemctl status dssghealthmond.service
● dssghealthmond.service - DSS-G Health Monitor
Loaded: loaded (/etc/systemd/system/dssghealthmond.service; disabled; preset: disabled)
Active: active (running) since Sat 2025-04-05 18:52:43 CST; 1min 57s ago
Process: 1361859 ExecStart=/opt/lenovo/dss/dssghealthmon/dssghealthmond $NODECLASS $MGMT (code=exited, status=0/SUCCESS)
Main PID: 1361946 (dssghealthmond)
Tasks: 2 (limit: 2468168)
Memory: 2.3M
CPU: 34.410s
CGroup: /system.slice/dssghealthmond.service
├─1361946 /bin/bash /opt/lenovo/dss/dssghealthmon/dssghealthmond dssg01 confluent 1361859
└─1362172 sleep 3600
Apr 05 18:52:43 dss01 systemd[1]: Starting DSS-G Health Monitor...
Apr 05 18:52:43 dss01 systemd[1]: Started DSS-G Health Monitor.
[root@dss01 ~]# cat /etc/dssg/dssghealthmon.env
NODECLASS=dssg01
MGMT=confluent
[root@dss01 ~]# cat /etc/dssg/dssghealthmon.hosts
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********
[root@dss02 ~]# systemctl status dssghealthmond.service
● dssghealthmond.service - DSS-G Health Monitor
Loaded: loaded (/etc/systemd/system/dssghealthmond.service; disabled; preset: disabled)
Active: active (running) since Sat 2025-04-05 18:52:43 CST; 4min 52s ago
Process: 87519 ExecStart=/opt/lenovo/dss/dssghealthmon/dssghealthmond $NODECLASS $MGMT (code=exited, status=0/SUCCESS)
Main PID: 87601 (dssghealthmond)
Tasks: 2 (limit: 2468168)
Memory: 2.2M
CPU: 1.344s
CGroup: /system.slice/dssghealthmond.service
├─87601 /bin/bash /opt/lenovo/dss/dssghealthmon/dssghealthmond dssg01 confluent 87519
└─87823 sleep 3600
Apr 05 18:52:43 dss02 systemd[1]: Starting DSS-G Health Monitor...
Apr 05 18:52:43 dss02 systemd[1]: Started DSS-G Health Monitor.
[root@dss02 ~]# cat /etc/dssg/dssghealthmon.env
NODECLASS=dssg01
MGMT=confluent
[root@dss02 ~]# cat /etc/dssg/dssghealthmon.hosts
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********
第四步:两台服务器均配置和测试发邮件,以QQ企业邮箱为例,注意需要urlencode
[root@dss01 ~]# cat ~/.mailrc
set v15-compat
set mta=smtps://yaoge%40yaoge123.com:************@smtp.exmail.qq.com:465 smtp-auth=login
set from=cicam@nju.edu.cn
[root@dss01 ~]# echo "$HOSTNAME" | mailx -s "TEST" yaoge@yaoge123.com
[root@dss01 ~]# scp ~/.mailrc dss02:/root/
.mailrc
[root@dss02 ~]# echo "$HOSTNAME" | mailx -s "TEST" yaoge@yaoge123.com
第五步:模拟故障测试
[root@dss01 ~]# mmvdisk rg list
[root@dss01 ~]# mmvdisk pdisk list --rg dss02
# 找到一个pdisk模拟失效
[root@dss01 ~]# mmvdisk pdisk change --rg dss02 --pdisk e1s44 --simulate-dead
[root@dss01 ~]# mmvdisk pdisk list --rg dss02 --not-ok
declustered
recovery group pdisk array paths capacity free space FRU (type) state
-------------- ------------ ----------- ----- -------- ---------- --------------- -----
dss02 e1s44 DA1 0 20 TiB 1024 GiB 03LC215 simulatedDead/draining/replace
# 可见已经开始rebuilding
[root@dss01 ~]# mmvdisk rg list --rg dss02 --all
needs user
recovery group node class active current or master server service vdisks remarks
-------------- ---------- ------- -------------------------------- ------- ------ -------
dss02 dssg01 yes dss02 yes 2
……
declustered needs vdisks pdisks capacity
array service type BER trim user log total spare rt total raw free raw background task
----------- ------- ---- ------- ---- ---- --- ----- ----- -- --------- -------- ---------------
NVR no NVR enable - 0 1 2 0 1 - - scrub 14d (66%)
SSD no SSD enable - 0 1 1 0 1 - - scrub 14d (20%)
DA1 yes HDD enable no 2 1 44 2 2 834 TiB 144 GiB rebuild-1r (8%)
……
vdisk RAID code disk group fault tolerance remarks
------------------ --------------- --------------------------------- -------
RG002LOGHOME 4WayReplication - rebuilding
RG002LOGTIP 2WayReplication 1 pdisk
RG002LOGTIPBACKUP Unreplicated 0 pdisk
RG002VS001 3WayReplication - rebuilding
RG002VS002 8+2p - rebuilding
第六步:等收到报警邮件后,恢复正常
[root@dss01 ~]# mmvdisk pdisk change --rg dss02 --pdisk e1s44 --revive
[root@dss01 ~]# mmvdisk pdisk list --rg dss02 --not-ok
mmvdisk: All pdisks of recovery group 'dss02' are ok.
第七步:检查日志
- /var/log/dssg/dssghealthmond.log.<timestamp> 是周期性状态检查的日志,时间戳是健康检查服务启动的时刻,每次周期检查的日志会附加在文件末尾。查看两个服务器日志可见只有一个服务器实际进行了健康检查的动作,另一个服务器知道它不是leader就不干了。
- /var/log/dssg/dssghealthmon.erf_* 是故障日志,文件名中包含故障类型、部件标识和时间戳,报警邮件正文就是这个文件的内容。
- 一旦解决故障后,应该在DSS的所有节点上删除ERF文件。