当前位置: 首页 > IT厂商, 系统 > 正文

Dell 12代服务器出现 CPU 1 has an internal error (IERR)错误

[摘要] Dell 12代 Dell PowerEdge R420服务器突然挂掉,无响应,Idrac可以连接,但是通过Idrac reset后毫无反应。记得之前同样的机器也挂掉过一台,因为没抓到更多有用的系统日志,当时也没太在意。
这次发现日志里面有错误出现了:“CPU 1 has an internal error (IERR)”,因为系统用keepalived配置了高可用,挂掉一台并不影响服务,所以并不着急,正好可以找找问题原因所在。
一边请教谷歌大神,一边致电Dell金牌服务:400-886-8618,技术支持听我描述一番后给出了如下建议:

(1)BIOS中修改System Profile Settings -> System Profile,修改为Performance
(2)升级BIOS版本:BIOS下载地址

Google的结果也说Dell12代服务器电源管理有问题,建议使用acpi-cpufreq电源管理模块

# modprobe -r p4-clockmod
# modprobe acpi-cpufreq

因为Idrac无法重启,于是找到了机房的remote hand,断电重启,居然能点亮,看来电源或者主板没问题,接下来好办了,Idrac全部可以搞定。
慢慢来,首先BIOS中修改了System Profile为Performance
然后升级了BIOS版本,从1.5.2升级到了2.1.2
过程如下:

# ./BIOS_R5R32_LN_2.1.2.BIN 
Collecting inventory...
....
Running validation...

BIOS

The version of this Update Package is newer than the currently installed version.
Software application name: BIOS
Package version: 2.1.2
Installed version: 1.5.2


Continue? Y/N:Y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.............................................................................
The BIOS image file is successfully loaded. To successfully apply the BIOS update, do not shut down, cold reboot, power cycle, or turn off the system before the
BIOS update is complete. Reboot the system for the update to take effect. Note:  If OMSA is installed on the system, the OMSA data manager service stops if it
is already running.
Would you like to reboot your system now?
Continue? Y/N:Y

Broadcast message from root@sudops.com
	(/dev/pts/0) at 23:16 ...

重启之后ssh登陆到系统,dmsg中发现有很多这样的日志:

p4-clockmod: Warning: EST-capable CPU detected. The acpi-cpufreq module offers voltage scaling in addition of frequency scaling. You should use that instead of p4-clockmod, if possible.
p4-clockmod: Warning: EST-capable CPU detected. The acpi-cpufreq module offers voltage scaling in addition of frequency scaling. You should use that instead of p4-clockmod, if possible.

看来google到的处理方法应该是有必要的,于是执行两条命令

# modprobe -r p4-clockmod
# modprobe acpi-cpufreq
FATAL: Error inserting acpi_cpufreq (/lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko): No such device
居然报错,说是找不到文件,但文件明明就在那呢,怎么会找不到?

# ls -l /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/*
-rwxr--r--. 1 root root 23672 Nov  9  2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko
-rwxr--r--. 1 root root  5824 Nov  9  2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/mperf.ko
-rwxr--r--. 1 root root 12160 Nov  9  2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/p4-clockmod.ko
-rwxr--r--. 1 root root 18552 Nov  9  2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/pcc-cpufreq.ko
-rwxr--r--. 1 root root 41704 Nov  9  2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/powernow-k8.ko
-rwxr--r--. 1 root root 13120 Nov  9  2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/speedstep-lib.ko

# modprobe -l acpi-cpufreq
kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko

继续Google。。
找到这样一篇jaseywang.me的文章,在Performance模式下是无法加载任何module的:

1. Performance Per Watt(DAPC): System DBPM(DAPC)
该模式是无法加载任何的 module 的:
# cpuspeed
Error: Could not find any CPUFreq controlled CPU cores to manage
# /etc/init.d/cpuspeed status
cpuspeed is stopped

2. Performance Per Watt(OS): OS DBPM
启动后可以发现,系统自动的加载了 acpi_cpufreq:
# lsmod | grep cpu
cpufreq_ondemand       10544  24
acpi_cpufreq            7891  1
freq_table              4881  2 cpufreq_ondemand,acpi_cpufreq
mperf                   1557  1 acpi_cpufreq

# /etc/init.d/cpuspeed status
Frequency scaling enabled using ondemand governor

3. Performance: Maximum Performance
该模式同样无法加在任何的 module 的

于是又回到BIOS中把 System Profile,修改为 Performance Per Watt(OS): OS DBPM

再次重启,dmsg中已经正常了,看来问题解决了,不过还有待于时间的考验!

Trouble shooting的过程中发现cpufreq_setup的使用方法比较有价值
https://access.redhat.com/site/documentation/zh-CN/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_setup.html

另外Dell的Idrac命令里面真的有很多选项
比如Idrac取到的sel日志如下:

racadm>>getsel 
racadm getsel  
-------------------------------------------------------------------------------
Record:      2
Date/Time:   05/22/2014 12:44:33
Source:      system
Severity:    Critical
Description: CPU 1 has an internal error (IERR).
-------------------------------------------------------------------------------

其他帮助参数

/admin1-> help
[Usage]
    show   [<options>] [<target>] [<properties>] 
           [<propertyname>== <propertyvalue>]
    set    [<options>] [<target>] <propertyname>=<value>
    cd     [<options>] [<target>]
    create [<options>] <target> [<property of new target>=<value>] 
           [<property of new target>=<value>]
    delete [<options>] <target>
    exit   [<options>]
    reset  [<options>] [<target>]
    start  [<options>] [<target>]
    stop   [<options>] [<target>]
    version [<options>]
    help   [<options>] [<help topics>]
    load -source <URI> [<options>] [<target>]
    dump -destination <URI> [<options>] [<target>]

/admin1-> racadm
racadm>>help 

racadm help  
 
 help [subcommand]    -- display usage summary for a subcommand
 arp                  -- display the networking ARP table
 clearasrscreen       -- clear the last ASR (crash) screen
 closessn             -- close a session
 clrraclog            -- clear the RAC log
 clrsel               -- clear the System Event Log (SEL)
 config               -- Deprecated: modify RAC configuration properties
 coredump             -- display the last RAC coredump
 coredumpdelete       -- delete the last RAC coredump
 eventfilters         -- Alerts configuration commands
 fwupdate             -- update the RAC firmware
 get                  -- display RAC configuration properties
 getconfig            -- Deprecated: display RAC configuration properties
 getled               -- Get the state of the LED on a module.
 getniccfg            -- display current network settings
 getraclog            -- display the RAC log
 getractime           -- display the current RAC time
 getsel               -- display records from the System Event Log (SEL)
 getsensorinfo        -- display system sensors
 getssninfo           -- display session information
 getsvctag            -- display service tag information
 getsysinfo           -- display general RAC and system information
 gettracelog          -- display the RAC diagnostic trace log
 getuscversion        -- display the current USC version details
 getversion           -- display the current version details
 ifconfig             -- display network interface information
 inlettemphistory     -- inlet temperature history operations
 lclog                -- LCLog operations
 frontpanelerror      -- hide LCD errors - color amber to blue
 netstat              -- display routing table and network statistics
 ping                 -- send ICMP echo packets on the network
 ping6                -- send ICMP echo packets on the network
 racdump              -- display RAC diagnostic information
 racreset             -- perform a RAC reset operation
 racresetcfg          -- restore the RAC configuration to factory defaults
 remoteimage          -- make a remote ISO image available to the server
 serveraction         -- perform system power management operations
 set                  -- modify RAC configuration properties
 setled               -- Set the state of the LED on a module.
 setniccfg            -- modify network configuration properties
 sshpkauth            -- manage SSH PK authentication keys on the RAC
 sslcertdelete        -- delete an SSL certificate on the iDRAC
 sslcertview          -- view SSL certificate information
 sslcsrgen            -- generate a certificate CSR from the RAC
 sslresetcfg          -- resets the web certificate to default and restarts the web server.
 testemail            -- test RAC e-mail notifications
 testtrap             -- test RAC SNMP trap notifications
 testalert            -- test RAC SNMP - FQDN trap notifications
 traceroute           -- print the route packets trace to network host
 traceroute6          -- print the route packets trace to network host
 usercertview         -- view user certificate information
 vflashpartition      -- manage partitions on the vFlash SD card
 vflashsd             -- perform vFlash SD Card initialization
 vmdisconnect         -- disconnect Virtual Media connections
 vmkey                -- Deprecated: perform vFlash operations
 license              -- License Manager commands
 debug                -- Field Service Debug Authorization facility commands
 raid                 -- Monitoring and Inventory of H/W RAID connected to the server.
 hwinventory          -- Monitoring and Inventory of H/W NICs connected to the server.
 nicstatistics        -- Statistics for NICs connected to the server.
 fcstatistics         -- Statistics for FCs connected to the server.
 update               -- Platform Update of the devices on the server
 jobqueue             -- Jobqueue of of the jobs currently scheduled
 systemconfig         -- Backup &/or Restore of iDRAC Config and Firmware
 
 Groups
 
idRacInfo            -- Information about iDRAC being queried
cfgRemoteHosts       -- Properties for configuration of the SMTP server
cfgUserAdmin         -- Information about iDRAC users
cfgEmailAlert        -- Parameters to configure e-mail alerting capabilities
cfgSessionManagement -- Information of the session Properties
cfgSerial            -- Provides configuration parameters for the iDRAC 
cfgOobSnmp           -- Configuration of the SNMP agent and trap capabilities
cfgRacTuning         -- Configuration for various iDRAC properties.
ifcRacManagedNodeOs  -- Properties of the managed server OS
cfgRacSecurity       -- Configure SSL certificate signing request settings
cfgRacVirtual        -- Configuration Properties for iDRAC Virtual Media
cfgActiveDirectory   -- Configuration of the iDRAC Active Directory feature
cfgLDAP              -- Configuration properties for LDAP settings
cfgLdapRoleGroup     -- Configuration of role groups for LDAP
cfgLogging           -- Group Description for group cfgLogging
cfgStandardSchema    -- Configuration of AD standard schema settings
cfgIpmiSerial        -- Properties to configure the IPMI serial interface
cfgIpmiSol           -- Configuration the SOL capabilities of the system
cfgIpmiLan           -- Configuration the IPMI over LAN of the system
cfgIpmiPef           -- Configuration the platform event filters
cfgServerPower       -- Provides power management features
cfgServerPowerSupply -- Provides information related to the power supplies
cfgVFlashSD          -- Configure the properties for the vFlash SD card
cfgVFlashPartition   -- Configure partitions on the vFlash SD Card
cfgUserDomain        -- Configure the Active Directory user domain names
cfgSmartCard         -- Properties to access iDRAC using a smart card
cfgServerInfo        -- Configuration of first boot device
cfgSensorRedundancy  -- Configure the power supply redundancy
cfgLanNetworking     -- Parameters to configure the iDRAC NIC
cfgStaticLanNetworking -- Parameters to configure the iDRAC NIC
cfgNetTuning         -- Group Description for group cfgNetTuning
cfgIPv6LanNetworking -- Configuration of the IPv6 over LAN networking
cfgIPv6StaticLanNetworking -- Configuration of the IPv6 over LAN networking
cfgIPv6URL           -- Configuration of the iDRAC IPv6 URL.
 
For Help on configuring the properties of a group - racadm help config
 
-----------------------------------------------------------------------

本文固定链接: https://sudops.com/dell-12g-cpu-1-has-an-internal-error.html | 运维速度

该日志由 u2 于2014年05月23日发表在 IT厂商, 系统 分类下,
原创文章转载请注明: Dell 12代服务器出现 CPU 1 has an internal error (IERR)错误 | 运维速度
关键字: ,

报歉!评论已关闭.