CIFS VFS: No response to cmd

  • A+
Categories:Linux Oracle

I never thought mounting a shared filesystem maybe cause server reboot!

One node in a three-node RAC database was rebooted without any reason, and I was contacted to find the possible reason.

I checked the servers and found it was hard to confirm the root cause as I could not get the system information before the reboot, and had to get all the information from the system and database log and trace files.

From the /var/log/messages of the affected server, got below errors:

Jan 21 22:18:51 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49711
Jan 21 22:19:40 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49715
Jan 21 22:20:29 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49719
Jan 21 22:21:18 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49723
Jan 21 22:22:07 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49727
Jan 21 22:22:56 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49731
Jan 21 22:23:45 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49735
Jan 21 22:24:34 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49739
Jan 21 22:25:23 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49743
Jan 21 22:26:12 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49747
Jan 21 22:27:01 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49751
Jan 21 22:27:50 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49755
Jan 21 22:28:39 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49759
Jan 21 22:29:28 prod3 kernel:  CIFS VFS: No response for cmd 162 mid 49763
Jan 21 22:29:58 prod3 kernel: SysRq : Resetting
Jan 21 22:32:12 prod3 syslogd 1.4.1: restart.

So the server was restarted at 22:29:58, and before these errors I also found other errors:

Jan 21 21:48:49 prod3 kernel:  CIFS VFS: Write2 ret -28, wrote 0
Jan 21 21:49:20 prod3 last message repeated 21269 times
Jan 21 21:50:21 prod3 last message repeated 43405 times
Jan 21 21:51:22 prod3 last message repeated 42692 times
Jan 21 21:52:23 prod3 last message repeated 43849 times
Jan 21 21:53:24 prod3 last message repeated 44724 times
Jan 21 21:54:17 prod3 last message repeated 88025 times
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: No response to cmd 47 mid 49592
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: No response to cmd 47 mid 49591
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
Jan 21 21:55:02 prod3 kernel:  CIFS VFS: No response to cmd 47 mid 49593

The server mounted a large share folder from a Windows server for backup purpose.

From other two nodes, got the related information:

prod1:
Jan 21 22:30:30 prod1 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down.
Jan 21 22:30:30 prod1 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516545000.229633 now 1516545030.229094 dr 1516545000.229617 adv 1516
545000.229647:1516545000.229648 func (49156591:504) 1516544986.229775:1516544986.229787)
Jan 21 22:30:30 prod1 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -112 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -112
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -112 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -112
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res P000000000000000000000000000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,6):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M000000000000000023700c00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b100d00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res M0000000000000001c8c00b00000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res M00000000000000000b101100000000, error -107 send AST to node 2
Jan 21 22:30:30 prod1 kernel: (dlm_thread,7670,20):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,6):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res M000000000000000003622c00000000, error -107 send AST to node 2
Jan 21 22:30:38 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:00 prod1 kernel: (o2net,7350,23):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_send_proxy_ast_msg:486 ERROR: 3ADDC1A2B0634122BC64ED298F64E0CD: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7636,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_send_proxy_ast_msg:486 ERROR: B9CD211F85F6429AAC04FB815C6EDC22: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7670,8):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_send_proxy_ast_msg:486 ERROR: 755B051EE3AD402784A78865F351F6F4: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7693,18):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_send_proxy_ast_msg:486 ERROR: 65D58ADDE3984ED79283A1DF83AEABD2: res S000000000000000000000200000000, error -107 send AST to node 2
Jan 21 22:31:03 prod1 kernel: (dlm_thread,7717,1):dlm_flush_asts:609 ERROR: status = -107
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (o2hb-755B051EE3,7684,12):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
Jan 21 22:31:03 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
Jan 21 22:31:06 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
Jan 21 22:31:07 prod1 kernel: (o2net,7350,17):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
Jan 21 22:32:39 prod1 kernel: o2net: accepted connection from node prod3 (num 2) at 172.20.20.62:7777
Jan 21 22:32:43 prod1 kernel: o2dlm: Node 2 joins domain 3ADDC1A2B0634122BC64ED298F64E0CD
Jan 21 22:32:43 prod1 kernel: o2dlm: Nodes in domain 3ADDC1A2B0634122BC64ED298F64E0CD: 0 1 2
Jan 21 22:32:48 prod1 kernel: o2dlm: Node 2 joins domain B9CD211F85F6429AAC04FB815C6EDC22
Jan 21 22:32:48 prod1 kernel: o2dlm: Nodes in domain B9CD211F85F6429AAC04FB815C6EDC22: 0 1 2
Jan 21 22:32:52 prod1 kernel: o2dlm: Node 2 joins domain 755B051EE3AD402784A78865F351F6F4
Jan 21 22:32:52 prod1 kernel: o2dlm: Nodes in domain 755B051EE3AD402784A78865F351F6F4: 0 1 2
Jan 21 22:32:56 prod1 kernel: o2dlm: Node 2 joins domain 65D58ADDE3984ED79283A1DF83AEABD2
Jan 21 22:32:56 prod1 kernel: o2dlm: Nodes in domain 65D58ADDE3984ED79283A1DF83AEABD2: 0 1 2

prod2:
Jan 21 22:30:29 prod2 kernel: o2net: connection to node prod3 (num 2) at 172.20.20.62:7777 has been idle for 30.0 seconds, shutting it down.
Jan 21 22:30:29 prod2 kernel: (swapper,0,16):o2net_idle_timer:1515 here are some times that might help debug the situation: (tmr 1516544999.275659 now 1516545029.274388 dr 1516544999.275645 adv 1516
544999.275670:1516544999.275671 func (70a70579:505) 1516544815.278433:1516544815.278436)
Jan 21 22:30:29 prod2 kernel: o2net: no longer connected to node prod3 (num 2) at 172.20.20.62:7777
Jan 21 22:30:59 prod2 kernel: (o2net,7207,11):o2net_connect_expired:1676 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,2): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,1): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,0): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (o2hb-755B051EE3,7503,0):ocfs2_dlm_eviction_cb:98 device (253,3): dlm has evicted node 2
Jan 21 22:31:03 prod2 kernel: (ocfs2rec,1910,0):ocfs2_replay_journal:1182 Recovering node 2 from slot 1 on device (253,2)

And I found possible reason from below two notes:

CIFS Mount Soft Lockup While Transferring Large Files (Doc ID 1621616.1)

Hung CIFS processes in D state and stuck in "cifs_reopen_file" Also the logs show "kernel: CIFS VFS: No writable handles for inode" before server getting rebooted (Doc ID 2318108.1)

Yes, sometimes the CIFS sharing even could cause server reboot.

From these notes, below methods could be taken:

echo 0 > /proc/fs/cifs/OplockEnabled
enable 'forcedirectio' option of the mount command

From the alert log of the database, I did not find useful information:

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.4.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 21-JAN-2018 22:22:45
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
Sun Jan 21 22:34:26 2018
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0

In the trace files of the cluster, I could only got the system worked abnormally at least from 22:24, and was evicted finally.

[/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5818:Aborted command 'check' for resource 'ora.prod.db'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent
_oracle/oraagent_oracle.log.
2018-01-21 22:24:17.869:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8815)]CRS-5832:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' was unable to process commands. Details at (:CRSAGF00128:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2018-01-21 22:24:43.241:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_gri
d/oraagent_grid.log.
2018-01-21 22:24:43.400:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/opmn/bin/onsctli" for action "check": details at
"(:CLSN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log"
2018-01-21 22:24:43.420:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/agent/crsd/or
aagent_grid/oraagent_grid.log.
2018-01-21 22:24:43.515:
[/u01/app/11.2.0/grid/bin/oraagent.bin(8717)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/app/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CL
SN00009:)" in "/u01/app/11.2.0/grid/log/prod3/agent/crsd/oraagent_grid/oraagent_grid.log"
2018-01-21 22:27:17.948:
[crsd(8528)]CRS-5825:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' is unresponsive and will be restarted. Details at (:CRSAGF00131:) {3:44246:2} in /u01/app/11.2.0/grid/log/prod3/crsd/crsd.log.
2018-01-21 22:28:59.355:
[cssd(8447)]CRS-1662:Member kill requested by node prod1 for member number 2, group DBprod
2018-01-21 22:29:00.073:
[cssd(8447)]CRS-1662:Member kill requested by node prod2 for member number 2, group DBprod
2018-01-21 22:29:30.225:
[cssd(8447)]CRS-1608:This node was evicted by node 1, prod1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
2018-01-21 22:29:30.235:
[cssd(8447)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log
2018-01-21 22:29:30.235:
[cssd(8447)]CRS-1652:Starting clean up of CRSD resources.
2018-01-21 22:29:30.846:
[cssd(8447)]CRS-1604:CSSD voting file is offline: /oraocr/storage/vdsk; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
2018-01-21 22:29:34.079:
[cssd(8447)]CRS-1605:CSSD voting file is online: /oraocr/storage/vdsk; details in /u01/app/11.2.0/grid/log/prod3/cssd/ocssd.log.
2018-01-21 22:29:45.238:
[cssd(8447)]CRS-1653:The clean up of the CRSD resources failed.
2018-01-21 22:29:48.309:
[ohasd(8031)]CRS-2765:Resource 'ora.ctssd' has failed on server 'prod3'.
2018-01-21 22:29:48.554:
[ohasd(8031)]CRS-2765:Resource 'ora.evmd' has failed on server 'prod3'.
2018-01-21 22:29:50.093:
[ctssd(25158)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log.
2018-01-21 22:29:54.187:
[ohasd(8031)]CRS-2878:Failed to restart resource 'ora.ctssd'
2018-01-21 22:29:54.202:
[ohasd(8031)]CRS-2769:Unable to failover resource 'ora.ctssd'.
2018-01-21 22:29:55.138:
[ctssd(25172)]CRS-2402:The Cluster Time Synchronization Service aborted on host prod3. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/prod3/ctssd/octssd.log.
2018-01-21 22:32:58.157:
[ohasd(8030)]CRS-2112:The OLR service started on node prod3.
2018-01-21 22:32:58.501:
[ohasd(8030)]CRS-1301:Oracle High Availability Service started on node prod3.

I guess the system maybe lost response from 22:24 while I could not confirm this point.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: