netapp fas3020 filer time skew issue…

problem: linux servers using nfs shares on a fas3020 filer create files that are not immediately timestamped and end up with creation times which are in the future when they are finally given a time stamp.

so here are the details of this issue (with some real-time output from our filer) that dogged us for about a month while we worked the issue in an attempt to resolve it.  our fas3020 filer, which serves up nfs shares to linux (rhel5) servers would not allow the proper time stamps to be placed on newly created files.  for some reason the times were always placed on the files after more than 60 seconds – until the time stamp was placed on the file it would simply show a month/year combination.  the output looked identical to what you see here:

[root]:[1193]> touch x  

(we would create a file named “x”)

[root]:[1194]> date 

(we would run the date command to see what the current system time was – even though it is in my bash shell prompt…just making sure!)

Mon Apr 13 14:30:40 EDT 2009
[root]:[1196] > ls -lai --full-time 

(we would then run the ls command using the –full-time option, which turned out to be a pretty cool option for seeing the time stamp on a file)

total 24
64 drwxrwxrwx   5 root     root     4096 2009-04-13 14:31:18.800445000 -0400 .
2 drwxr-xr-x  27 root     root     4096 2009-03-27 08:31:37.000000000 -0400 ..
101 drwxr-xr-x   4 root     root     4096 20 09-03-27 09:20:03.072649000 -0400 BACKUPS
67 drwxrwxrwx  13 root     root     4096 2009-04-13 12:00:19.162472000 -0400 .snapshot
7881111 drwxrwxrwx   3 sanadmin sanadmin 4096 2009-04-02 09:38:39.717613000 -0400 UPGRADES
7881116 -rw-r--r--   1 root     root        0 2009-04-13 14:31:18.801444000 -0400 x 

(as you can see here, the file “x” which was actually created at 14:30 is actually time stamped in the future at 14:31:18 and not at the time it was created!)

a quick aside, and i would like to give full credit to the original author (or at least the individual who was kind enough to post up this information), of a perl command that will do much the same thing as the –full-time option to ls (but clearly not as short and sweet) is:

perl-e’@d=localtime((stat(shift[9])printf”%02d-%02d-%04d%02d:%02d:%02d\n”,$d[3],$d[4]+1,$d[5]+1900,$d[2],$d[1],$d[0]‘<file name>

we were able to replicate this issue at will and noticed that the time was always in the future and always over 60 seconds.  we contacted our netapp se who indicated that since we were running ontap 7.2.5.1 that we should take the opportunity to upgrade to ontap 7.2.6.1 and this will also rule out some sort of code bug.  during the upgrade process while we were waiting for cf takeover command to complete, we were lucky enough to catch the following output as it scrolled across the screen:

nas1> Tue Jun  9 17:06:09 UTC [nas1: kern.timed.toolarge:warning]: server ‘xxx.xxx.xxx.xxx’
reports the date is Tue Jun  9 17:05:01 UTC 2009 (appliance is fast by 68.806 seconds).
The date offset is too large to change automatically (timed.max_skew: 60.000 seconds).
If the date reported by the date server is correct, set the filer date using the ‘date’ command.

at this point, we knew we had solved the issue.  upon logging into both of our filer heads, we immediately noticed that our options setting timed.max_skew was, in fact, set to 60 seconds.  here is what the output for this setting looks like when you are logged on to the filer at the command prompt and type the command “options” (which lists out all your option settings):

timed.max_skew               1m         (same value in local+partner recommended)

it is also important, as the output displays, that you ensure that you have the same setting on the partner filer if you are in a dual-controller/failover configuration.  what this meant was that in the event that the filer head time strayed more than 60 seconds from the “official” ntp server time that the filer heads would simply not update their time.  in order to fix the issue, you can either run the date command on the filer or try what we attempted which was to simply change the timed.max_skew setting to 120 seconds.  both will work.  the command to make the change to the timed.max_skew setting so that it is set to 2 minutes is the following:

option timed.max_skew 120

now running the “options” command displays what you would expect:

timed.max_skew               2m         (same value in local+partner recommended)

after an hour (the setting we have for the filers to poll the ntp servers) the times were put back in sync and the issue of the time stamp being off has completely disappeared.  during the process a co-worker of mine ran across a post at the following web link:

netapp communities link to post

this post was done by a gentleman by the name of tim abbott and when we found it, we were excited that hopefully a fix/solution had been posted in response, but no such luck.  we posted up our output as well in the hopes that someone would chime in as well.  i have also posted a link to this page in response to the final post in that thread and who knows, maybe some day someone else will find this and look like a rock-star…

as to why or how the time initially got to be more than 60 seconds off, the only thing we can surmise is that when the system was initially installed we didn’t do the ntp setup right away and the date command which was run was off by 1 minute.  then, the system wasn’t used for at least 4 to 5 months and by the time one of our programmers found the problem and reported it to us – well, it was all a distant memory.

devnull