On Jan 16, one Linux server crashed and the Oracle database could not be opened after server reboot, the prompt was the file was used by other process.
It was surprising as we had restarted the server twice, while still got the same error message, so the file was locked in the NAS server, not in the OS itself. To resume the business application, I had to restore the full database, but still want to know what locked these data files.
I wrote a document to explain this issue, so just copy it as below:
First, this issue could be reproduced; second, this issue could be fixed in several minutes.
This note is very useful for me to reproduce and fix this issue.
How to clear NFS locks during network crash or outage for Oracle datafiles
Below I’ll list the detail steps to reproduce this issue, and list the methods to resolve it.
1. When will this issue happen?
Crash is the most popular situation, and NFS locks will be left in the NAS server if the crash happens. The left locks will be released when some servers are powered on, but there are some servers always cannot clear the locks by themselves, and will cause Oracle cannot startup.
2.What’s the difference between the two kinds of servers?
To reproduce this issue, I downloaded the Netapp Simulator from the website and built a test environment, but found the locks will be freed every time when the server was booted, until I focused on this part of the above note:
Remove the NFS lock files on the host.
From TR-3183 - Using the Linux NFS Client with Network Appliance Storage,
rpc.statd runs the gethostbyname() method to determine the client's name, but lockd (in the Linux kernel) runs uname -n.
By changing the HOSTNAME= fully qualified domain name, lockd will use an FQDN when contacting the storage. If there is a lnx_node1.iop.eng.netapp.com and also a lnx_node5.ppe.iop.eng.netapp.com contacting the same NetApp storage, the storage will be able to correctly distinguish the locks owned by each client. Therefore, it is recommended to use the fully qualified name in /etc/sysconfig/network. In addition to this, running sm_mon -l or lock break on the storage will also clear the locks on the storage which will fix the lock recovery issue.
Additionally, if the client's nodename is fully qualified (that is, it contains the hostname and the domain name spelled out), then rpc.statd should also use a fully qualified name. Likewise, if the nodename is unqualified, then rpc.statd must use an unqualified name. If the two values do not match, lock recovery will not work. Ensure that the result of gethostbyname(3) matches the output of uname -n by adjusting your client's nodename in /etc/hosts, DNS, or the NIS databases.
So I changed the hostname of my test server from localhost.localdomain to localhost, then this issue was reproduced finally.
Detail steps are:
*Make sure the hostname of the test server is not a FQDN name
*Mount the filesystem from the NAS server
*Startup the Oracle database, and the locks will be placed on the NAS server.
*Shutdown the Oracle database then the locks will be released
*Startup the Oracle database again, and check the lock status one more time
*Do not shutdown the Oracle database, but halt the system directly, and find the locks will be freed also
*Startup the server and the Oracle database again, also check the lock status
*Close the test server directly, and this action is almost the same of crash.
*Check the lock status after several minutes, and find they are not cleared
*Startup the OS, then check the lock status (If the hostname is FQDN name, the lock will be freed automatically)
*Try to startup the Oracle database, will get below errors
*Check the Oracle alert log
So this issue is reproduced now.
How to fix this issue?
1.Do not crash the server!
2.Use FQDN name for server name.
3.Clear the locks from the NAS server
From the previous Netapp note, we could use below two commands to clear the locks.
sm_mon -l (for Data ONTAP version earlier than 7.1, in fact also work for later version)
lock break -h [server] (Data ONTAP version later than 7.1)