Saw some interesting behavior recently in Windows Server 2012 Clustering.
We noticed that we were unable to perform any live migrations on one of our clusters.
Looking in Cluster Events we saw a lot errors - Event Id 1196 coming up regularly, around every 10-15 minutes or so. We also saw errors for Event Id 1206 when restarting the Cluster Name Resource.
Event Id 1196 -
Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason:
The handle is invalid.
Event Id 1206 -
The computer object associated with the cluster network name resource 'Cluster Name' could not be updated in domain 'xxxxx.xxxx'. The error code was 'Password change'. The cluster identity 'xxxxxxx$' may lack permissions required to update the object. Please work with your domain administrator to ensure that the cluster identity can update computer objects in the domain.
The cluster logs also showed some interesting errors -
00000ef0.00002cb0::2013/06/17-15:13:28.905 WARN [RES] Network Name: [NNLIB] LogonUserEx fails for user xxxxxxx$: 1326 (useSecondaryPassword: 0)
00000ef0.00002cb0::2013/06/17-15:13:28.961 WARN [RES] Network Name: [NNLIB] LogonUserEx fails for user xxxxxxx$: 1326 (useSecondaryPassword: 1)
00000ef0.00002cb0::2013/06/17-15:13:28.961 INFO [RES] Network Name: [NNLIB] Logon failed for user megadrive$ (Error 1326), DC \xxxxxx.xxxx.xxxx, domain campus.bath.ac.uk
"Binging" around - really, do people Bing things? - "Googling" around (Do you think the most Binged thing is "How do I set Google as my default search engine in IE?") we came across of lots interesting tails about how people had rebuilt thier cluster, or reverted to Windows 2008 R2 because the problems weren't seen there.
And then we stumbled across a blog by the Windows Server Core Team - http://blogs.technet.com/b/askcore/archive/2012/09/25/cno-blog-series-increasing-awareness-around-the-cluster-name-object-cno.aspx about how sysadmins need more 'awareness' about Cluster Name Object (CNO). When a Cluster is created the computer object for the Cluster is created in the Computers container. If it is moved to another OU then "the non-default location may not have the rights it needs for other cluster operations" . We had moved ours to a different OU. We also saw a number of other articles that were similar to our problem but not the same - http://www.andrewparisio.com/2012/12/windows-failover-cluster-live-migration.html
This was our fix -
1 - Move CNO back to the Computers Container
2- Give the Cluster Node Computer Accounts Change Password permission on the CNO
3 - Take the Cluster Name Resource offline
4 - Repair Cluster Name Resource
5 - Bring Cluster Name Resource back online
Job done, can now Live Migrate
Hope this helps anyone in a similar situation