Lenovo DSS-G 监控

Lenovo DSS-G 是联想将GPFS集成的软硬件一体化设备,作为一套存储监控非常重要,等到丢数据的时候就追悔莫及了。

组件数据库

DSS-G 使用组件数据库(compDB )来描述硬件组件,如机柜、DSS-G的服务器和JBOD。使用 DSS-G 不需要compDB,但是健康监控需要。

第一步:生成组件数据库

始终使用dry run来生成组件数据库(compDB),其中dssg01是DSS-G两台服务器的node class(dssgmkstorage时创建的)

[root@dss01 ~]# dssgmkcompdb --racktype RACK42U --verbose --dryrun -N dssg01
DSS-G 5.0c
Parsing options: --racktype RACK42U --verbose --dryrun -N dssg01 --
Entering verbose mode
Entering dry run mode

Using provided nodeclass or node list
dss01
dss02
Checking whether all nodes belong to the same cluster...

Using rack type "RACK42U" of height 42U

Checking server model...

…………

# DSS-G210-1

Setting componentDB...
Note: all commands below support the --dry-run option
DRYRUN: mmaddcomp -F /root/yaoge123gpfs.io01-comp.2025-03-15.233334.851103.stanza --replace
DRYRUN: mmchcomploc -F /root/yaoge123gpfs.io01-comploc.2025-03-15.233334.851103.stanza
DRYRUN: /root/yaoge123gpfs.io01-dispid.2025-03-15.233334.851103.sh

All done

第二步:检查两个stanza文件内容,其中组件安装在机柜中的具体位置在 -comploc..stanza 文件中

第三步:导入组件数据库,照抄dssgmkcompdb最后面三行DRYRUN后面的命令运行,然后用mmlscomp和mmlscomploc检查结果。

[root@dss01 ~]# mmlscomp

    Rack Components

Comp ID  Part Number  Serial Number  Name
-------  -----------  -------------  ----
      6  RACK42U      *******        A1

    Server Components

Comp ID  Part Number  Serial Number  Name              Node Number
-------  -----------  -------------  ----------------  -----------
      9  7D9ECTOLWW   ********       SR655V3-********          101
     10  7D9ECTOLWW   ********       SR655V3-********          102

    Storage Enclosure Components

Comp ID  Part Number  Serial Number  Name            Display ID
-------  -----------  -------------  --------------  ----------
      8  7DAHCT0LWW   ********       D4390-********

    Storage Server Components

Comp ID  Part Number  Serial Number  Name
-------  -----------  -------------  ----------
      7  DSS-G210                    DSS-G210-1
[root@dss01 ~]# mmlscomploc

Rack  Location  Component
----  --------  ----------------
A1    U35-36    SR655V3-********
A1    U33-34    SR655V3-********
A1    U01-04    D4390-********

Storage Server  Index  Component
--------------  -----  ----------------
DSS-G210-1          3  SR655V3-********
DSS-G210-1          2  SR655V3-********
DSS-G210-1          1  D4390-********

设置DSS-G监控

因为DSS-G本身的冗余性(节点冗余、链路冗余等),出现故障不一定会导致整个文件系统失效,这往往会导致故障没有被立刻发现,但是这些故障会破坏冗余性或导致性能下降,管理员应该知晓这些故障并积极修复。dssghealthmon 它会周期性(默认1小时)检查整个DSS-G的健康状态,如有故障会自动发送邮件通知。dssghealthmon 可以通过confluent节点带外访问XCC(服务器的BMC)检查硬件健康状态,所以需要DSS-G服务器到Confluent的免密码SSH。

第一步:编辑配置文件 /etc/dssg/dssghealthmon.conf

  • contactEmail: 故障通知的Email,必填项。
  • 其它不想填就都注释了

第二步:先查看一下DSS-G的nodeclass名称,然后启动监控

[root@dss01 ~]# mmlsnodeclass 
Node Class Name       Members
--------------------- -----------------------------------------------------------
dssg01                dss01,dss02

[root@dss01 ~]# dssghealthmon_startup dssg01 confluent
Obtaining Confluent version from management server confluent
3.11.1

Processing nodeclass...
Node Class Name       Members
--------------------- -----------------------------------------------------------
dssg01                dss01,dss02

Parsing configuration file...

Copying configuration file...

Creating tuple file...
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********

Copying tuple file...

Setting or replacing the dssghealthmon_erflist cronjob...
Warning: dssghealthmon_erflist cronjob has NOT been specified

Creating and copying daemon environment file...

Starting dssghealthmond...
The dssghealthmon system has been successfully started

第三步:在两个服务器上检查监控状态和配置文件

[root@dss01 ~]# dssghealthmon_status dssg01
Processing nodeclass...
Node Class Name       Members
--------------------- -----------------------------------------------------------
dssg01                dss01,dss02

Obtaining status of the DSS-G health monitor...
dss01: active
dss02: active
[root@dss01 ~]# systemctl status dssghealthmond.service
● dssghealthmond.service - DSS-G Health Monitor
     Loaded: loaded (/etc/systemd/system/dssghealthmond.service; disabled; preset: disabled)
     Active: active (running) since Sat 2025-04-05 18:52:43 CST; 1min 57s ago
    Process: 1361859 ExecStart=/opt/lenovo/dss/dssghealthmon/dssghealthmond $NODECLASS $MGMT (code=exited, status=0/SUCCESS)
   Main PID: 1361946 (dssghealthmond)
      Tasks: 2 (limit: 2468168)
     Memory: 2.3M
        CPU: 34.410s
     CGroup: /system.slice/dssghealthmond.service
             ├─1361946 /bin/bash /opt/lenovo/dss/dssghealthmon/dssghealthmond dssg01 confluent 1361859
             └─1362172 sleep 3600

Apr 05 18:52:43 dss01 systemd[1]: Starting DSS-G Health Monitor...
Apr 05 18:52:43 dss01 systemd[1]: Started DSS-G Health Monitor.
[root@dss01 ~]# cat /etc/dssg/dssghealthmon.env 
NODECLASS=dssg01
MGMT=confluent
[root@dss01 ~]# cat /etc/dssg/dssghealthmon.hosts 
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********

[root@dss02 ~]# systemctl status dssghealthmond.service
● dssghealthmond.service - DSS-G Health Monitor
     Loaded: loaded (/etc/systemd/system/dssghealthmond.service; disabled; preset: disabled)
     Active: active (running) since Sat 2025-04-05 18:52:43 CST; 4min 52s ago
    Process: 87519 ExecStart=/opt/lenovo/dss/dssghealthmon/dssghealthmond $NODECLASS $MGMT (code=exited, status=0/SUCCESS)
   Main PID: 87601 (dssghealthmond)
      Tasks: 2 (limit: 2468168)
     Memory: 2.2M
        CPU: 1.344s
     CGroup: /system.slice/dssghealthmond.service
             ├─87601 /bin/bash /opt/lenovo/dss/dssghealthmon/dssghealthmond dssg01 confluent 87519
             └─87823 sleep 3600

Apr 05 18:52:43 dss02 systemd[1]: Starting DSS-G Health Monitor...
Apr 05 18:52:43 dss02 systemd[1]: Started DSS-G Health Monitor.
[root@dss02 ~]# cat /etc/dssg/dssghealthmon.env 
NODECLASS=dssg01
MGMT=confluent
[root@dss02 ~]# cat /etc/dssg/dssghealthmon.hosts 
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********

第四步:两台服务器均配置和测试发邮件,以QQ企业邮箱为例,注意需要urlencode

[root@dss01 ~]# cat ~/.mailrc 
set v15-compat
set mta=smtps://yaoge%40yaoge123.com:************@smtp.exmail.qq.com:465 smtp-auth=login
set from=cicam@nju.edu.cn
[root@dss01 ~]# echo "$HOSTNAME" | mailx -s "TEST" yaoge@yaoge123.com

[root@dss01 ~]# scp ~/.mailrc dss02:/root/
.mailrc

[root@dss02 ~]# echo "$HOSTNAME" | mailx -s "TEST" yaoge@yaoge123.com

第五步:模拟故障测试

[root@dss01 ~]# mmvdisk rg list
[root@dss01 ~]# mmvdisk pdisk list --rg dss02
# 找到一个pdisk模拟失效
[root@dss01 ~]# mmvdisk pdisk change --rg dss02 --pdisk e1s44 --simulate-dead
[root@dss01 ~]# mmvdisk pdisk list --rg dss02 --not-ok

                              declustered                                                
recovery group  pdisk            array     paths  capacity  free space  FRU (type)       state
--------------  ------------  -----------  -----  --------  ----------  ---------------  -----
dss02           e1s44         DA1              0    20 TiB    1024 GiB  03LC215          simulatedDead/draining/replace

# 可见已经开始rebuilding
[root@dss01 ~]# mmvdisk rg list --rg dss02 --all

                                                                        needs    user 
recovery group  node class  active   current or master server          service  vdisks  remarks
--------------  ----------  -------  --------------------------------  -------  ------  -------
dss02           dssg01      yes      dss02                             yes           2  

……

declustered   needs                         vdisks       pdisks           capacity     
   array     service  type    BER    trim  user log  total spare rt  total raw free raw  background task
-----------  -------  ----  -------  ----  ---- ---  ----- ----- --  --------- --------  ---------------
NVR          no       NVR   enable   -        0   1      2     0  1          -        -  scrub 14d (66%)
SSD          no       SSD   enable   -        0   1      1     0  1          -        -  scrub 14d (20%)
DA1          yes      HDD   enable   no       2   1     44     2  2    834 TiB  144 GiB  rebuild-1r (8%)

……

vdisk               RAID code        disk group fault tolerance         remarks
------------------  ---------------  ---------------------------------  -------
RG002LOGHOME        4WayReplication  -                                  rebuilding
RG002LOGTIP         2WayReplication  1 pdisk                            
RG002LOGTIPBACKUP   Unreplicated     0 pdisk                            
RG002VS001          3WayReplication  -                                  rebuilding
RG002VS002          8+2p             -                                  rebuilding

第六步:等收到报警邮件后,恢复正常

[root@dss01 ~]# mmvdisk pdisk change --rg dss02 --pdisk e1s44 --revive
[root@dss01 ~]# mmvdisk pdisk list --rg dss02 --not-ok
mmvdisk: All pdisks of recovery group 'dss02' are ok.

第七步:检查日志

  • /var/log/dssg/dssghealthmond.log.<timestamp> 是周期性状态检查的日志,时间戳是健康检查服务启动的时刻,每次周期检查的日志会附加在文件末尾。查看两个服务器日志可见只有一个服务器实际进行了健康检查的动作,另一个服务器知道它不是leader就不干了。
  • /var/log/dssg/dssghealthmon.erf_* 是故障日志,文件名中包含故障类型、部件标识和时间戳,报警邮件正文就是这个文件的内容。
  • 一旦解决故障后,应该在DSS的所有节点上删除ERF文件。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理