Netapp

Don’t Forget to Assign Ownership to Disks on your Netapp Filers

The other day a co-worker was doing his rounds in the data center, sent the team an email with (photos attached) indicating amber lights on one of the shelves of our Netapp Filer.  Amber lights on a shelf of disk on your SAN usually isn’t a good thing so I jumped onto OnCommand System Manager to begin the troubleshooting.It didn’t take me very long to figure out the problem because the first thing I see after I am logged into OnCommand was a warning that I had “unowned disks”.

The aha moment hit me, just the day before we had 2 disks fail and were replaced via the configured Autosupport. Gosh, I do like the autosupport. Having a disk go bad overnight then getting a replacement before you even walk through the door is very convenient.

When these particular disks were replaced by another co-worker the day before, he had removed the bad disks and inserted the replacement disks. Since our filers are setup with software disk ownership , with disk auto assign disabled, the disks were not assigned automatically to an owner.  Ownership must be assigned for disk(s) before the filer can use them otherwise they are useless and flash amber lights at you.

To assign ownership to the disk(s) SSH into your Filer and do the following :

  1. You will need to locate the disk(s) that don’t have any owners. Type the following command
    disk show -n
  2. Once you have the disk name of the unowned disk you assign ownership with this command:
    disk assign <disk_name>
    disk assign 0b.16
    for multiple disks < disk assign 0b.43 0b.24 0b.27
    or
    Assign ownership for all unowned disks at once < disk assign all
  3. Run disk show -v to verify the disk assignments

So there you go, a pretty simple fix. Next time you replace a failed drive don’t forget to give it an owner!

Advertisement

SnapManager for Exchange fails to run scheduled snaps after running an upgrade to 6.0.4

Sometimes fixes & patches introduce another set of issues that will give way to another set of new patches and fixes.

In our case, it was our upgrade to SnapManager for Exchange( SME) 6.0.4 which had fixes to some bugs we were facing. Everything seemed to go real well, all the upgrades on the Exchange 2010 DAG member servers didn’t hiccup one bit. This was too good to be true, an upgrade of SME and no issues so far. I had my fingers crossed and was hoping for the best, maybe luck would be in our corner.

No Joy…

After completing the upgrade on all servers I needed to run a test of some exchange snaps. Got to make sure it works right? I first started out running manual snaps on all the databases on each node. Those worked great, No Problems.

So onward to the next test which was to kick off a scheduled snap of the DAG databases. After kicking off a scheduled snap through task scheduler the snaps failed to run. After some digging around and a few more tests, my co-worker discovered that there is bug when you upgrade to SME 6.0.4 which causes scheduled snaps to fail.

According to Netapp’s KB 649767 article it has to do the value “0” is not selectable in the “retain up-to-the-minute restorability” option in the GUI of this release like it was in previous releases.  When running the snaps through the GUI of SME 6.0.4 , you can manually enter the value “0” and the run the job immediately, backups will work. The issue occurs when SME creates a scheduled job; it creates the job with wrong parameter , it be should be NoUtmRestore if you don’t want to retain any transaction logs.

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=649767

SME604_a

Getting Backups to work again…

To get scheduled backups to work again you will need to do one of 2 things:

  • Change the -RetainUTMDays and -RetainUTMBackups from something other than “0”. Changing the value to something other than “0” will retain your transaction logs for the specified value
  • If you don’t want to keep any transaction logs, manually modify the scheduled job and remove the -RetainUTMDays or -RetainUTMBackups parameters then replace with NoUtmRestore.
    • If you are running a DAG remember you will need to modify the scheduled job across all DAG members that have the scheduled job.

SME604

Unable to resize a Netapp volume?

There maybe times that you will need or want to re-size a Netapp volume and normally this is process is very easy to do. You can re-size a Netapp volume using  the ONcommand tool, Filerview, or even SSH into the filer directly. Either of these ways is perfectly fine, no wrong way to do it except for when it fails.

A common error I have seen for failed volume re-sizing is due to the “fs_size_fixed”  error.

vol3

The “fs_size_fixed” is a parameter that has been enabled on the volume either during setup or during a snapmirror relationship break. The parameter is there to prevent any type of accidental re-sizing on the volume.  The only way to re-size the volume is to remove the “fs_size_fixed” parameter by connecting directly to filer through an SSH tool and running the following commands. Once the parameter is removed you will be able to re-size the volume.

1. Connect to the filer ( you can use Putty if you have it)
2. First verify that “fs_size_fixed” is enabled on the vol, type : vol status [name of vol]

you will see in the status that “fs_size_fixed” is set to ON

vol1

3.  At the prompt type  : vol options [name of vol] fs_size_fixed off

4. You can confirm that the parameter has been disabled by typing : vol status [name of vol]

vol2

Unable to Resize iSCSI LUN Using SnapDrive on Windows Server 2003 R2

I recently had another one of my weird Snapdrive issues while trying to resize an iSCSI Lun on a 2003 server. The server is a VM that is using the Microsoft iSCSI initiator and Snapdrive to manage the Netapp provisioned Lun. Re-sizing a lun using Snapdrive is normally very simple but of course on this particular day it was not behaving for me.

Snapdrive appeared to be running ok and didn’t seem to have any issues at all that day. The problem came when I attempted to re size the lun, Snapdrive re-sizing process would fail halfway through. The failure to complete the re-sizing left me puzzled since all connections to the filer appeared to be fine. There was plenty of space left on the volume so it wasn’t a space issue.

Since we were dealing with Windows here we rebooted the server just in case it was pending a reboot or it just needed to “clear it’s  head”. After the reboot I attempted to re-size the lun again and again it failed . The actual failure message was that it was unable to connect to the disk. Odd…It’s connected in Snapdrive , it just won’t resize.

The next thing I thought of was to force a disconnect on the iscsi lun, this way it would forces a disconnect on all connections. The downside to the disconnect was that the Lun would be lost and the SQL databases would need to be stopped. After getting approval to take the server down again, I then proceeded to force a disconnect of this lun. Once all connections were stopped and confirmed they were gone, I then reconnected the iSCSI Lun using Snapdrive.

After the re-connection was completed, I continued with trying to re-size the Lun. BAM! It worked. All it took was a force disconnect , reconnect, then I could re-size. To be honest ,  I wasn’t in the mood to go further digging into a root cause for the failure, especially since I got it working now. I suspect it had something to do with Snapdrive and the iscsi connection it was using since a brand new connection seemed to clear any issues that it had previously. So, if you run into something like this, it might be worth a force disconnect to solve your re-sizing problem.

 

 

 

 

Failed Login to Netapp Filer using SSH/Putty

Netapp filers can be accessed and managed many ways, including using Putty to SSH into the filer itself.  In addition to FilerView, there is also another web based tool called Netapp OnCommand  System Manager that is GUI based which gives a very nice graphical performance chart detailing how HOT your filers are running. The OnCommand tool is great for everyday management of the filers but sometimes you will need to access the filers via Putty to run more advanced functions , ie. killing a NDMP session that is hung.

We had an interesting issue today while trying to access one of our Netapp filers using Putty. Every time we would we try try to log  into the filer with a Putty session we would get an access denied or the Putty session would simply close. What was odd was that it didn’t happen for all of the us Storage Engineers. Thinking that maybe are accounts are locked or maybe  the access got removed I started the OnCommand session and attempted to log into the filers.

Not a single hiccup. Logged in right away on every single filer we have. hmmm….so I can log in with my credentials using OnCommand but when using a Putty session I can’t. Yet, another storage engineer can login to both and we all have the same permissions. All filers were checked for locked accounts including Active Directory, nothing was locked.

After some more head scratching one of the other Storage Engineers stumbled upon a setting within the OnCommand System Manager setting that was caching our passwords. Once the tick box to cache passwords was cleared we were able to  successfully log onto the filers.

To remove the cache passwords in OnCommand :

  1. Run OnCommand System Manager and log onto any filer
  2. In the top left hand corner select to Tools
  3. Select Options         

oncommand
4. Clear the Enable Cache Passwords tick box

Oncommand2

  1. Select Clear Existing Passwords

  2. Select Save and Close

Once the settings were changed we were both able to Putty to the filers. Gotta Love the gotchas of cached passwords.

Snapdrive services failing to start on Windows Server 2008 x64

Snapdrive for Windows  is Netapp’s storage management software that allows you to easily provision storage, backup and restore your data on a Windows server. It’s a great tool when it works but when it doesn’t it’s a bear. I just recently had the experience of troubleshooting some of our servers that had some Snapdrive issues connecting to our filer. The server’s iSCSI connection was not affected so the issue went unnoticed for some time until a request to expand luns was made….That’s when it was discovered that the Snapdrive service was not running and failing to start.

When Snapdrive was opened the mmc would crash which then resulted in the following error in the Snapdrive MMC:

Web Service Client Channel was unable to connect to the LUNProvisioningService instance on machine ServerName.
Could not connect to ‘net.tcp://ServerNameSnapDrive/LUNProvisioningService.’ The connection attempt lasted for a time span of 00:00:00. TCP error code 10061: No connection could be made because the target machine actively refused it 

The event that appeared in the application logs:

Description:
Log Name: Application

Source: SnapDrive
Date: 1/05/2013 10:41:33 AM
Event ID: 101
Task Category: Generic event
Level: Error
Keywords: Classic
User: N/A
Computer: myserverxxx.com
Description:
SnapDrive service failed to start.
Error code : SnapDrive Web Service failed to start Reason: ‘The TransportManager failed to listen on the supplied URI using the NetTcpPortSharing service: failed to start the service. Refer to the Event Log for more details.’

I immediately jumped onto Netapp’s support site and starting searching for known issues. One post had indicated to check the permissions of the account accessing the filer and make sure it had local admin rights to the server, I knew that wasn’t issue because the account already had local admin rights. Plus, Snapdrive was working up until recently so permissions would be on the bottom of the list of culprits.The next few hits on the forums indicated that IIS admin needed to be enabled and ensure that the .NetTCPSharing service was enabled. When I checked for the services , IIS admin wasn’t even installed  and the .NetTCPPortSharing was in a disabled state.  I attempted to re-enable the service but it failed as I expected it too. Odd, I thought, Where is the IIS admin service?  What would prevent these services from starting?

Since IIS admin wasn’t available I went to Server Manager and confirmed it wasn’t installed and installed the feature through server manager. After the installation was completed I attempted to start the .NetTCPSharing server and the Snapdrive services again but all of them failed. Back to scratching my head again.

It took some digging but eventually I came to Netapp KB2013168 . The article noted  the following “.NetFramework and the Net.Tcp PortSharing Service. If .Net is not properly installed or the Net.Tcp PortSharing Service service are not functioning correctly, SnapDrive will not be able to connect to the LUNProvisioningServices and the ability to manage LUNs via the MMC can be impaired.”

Oh Snap! Anybody that knows me in “real” life knows how much the word .Net just gets under my skin. I’ve had to deal with so many issues that involved corrupted installs of .Net or some sort of Microsoft patch that would  “break” .Net and the application that depended on it, that I’ve grown a hatred for the word .Net.

Now that I’ve something to go on,  I followed the steps in the KB article for issue #2  and issue #3 ( the symptoms I was experiencing);

Issue 2:
Directory permissions to C:\WINDOWS\Microsoft.NET\Framework\v3.0\Windows Communication Foundation\SMSvcHost.exe.
For the NT Authority\Local Service account to be able to start this service, users must have read and execute permissions to the above path.

Resolution to Issue 2:
Incorrect permissions where configured on the C:\windows directory.
Verify that users have read and execute permissions to the path C:\WINDOWS\Microsoft.NET\Framework\v3.0\Windows Communication Foundation\SMSvcHost.exe.

Well, permissions wasn’t it because everything was there. Now onto issue #3

Issue 3:
SnapDrive 6.x service did not start because the ‘Net.Tcp Port Sharing service’ will not stay started. This is a dependency SnapDrive 6.x has that earlier versions do not.

Resolution to Issue 3:

Reinstall Microsoft .Net.

Reinstall .Net? Great , this should be fun  I thought to myself. I confirmed via Add/Remove Programs that the .Net 3.5 was installed but  the document referenced that Snapdrive required .Net 3.0  sp1 and that particular version was not listed anywhere. On a hunch, I went to server manager > Features > to see if the .Net 3.0 framework features were installed and Yes it was! Using the Server Manager wizard I removed the .Net 3.0 Framework Features, which requires a reboot to complete.

Once the uninstall was completed I re-installed the .Net 3.0 Framework using the same Server Manager wizard.When the installation completed I rebooted the server for good measure, once the server came back online the Snapdrive service was running again. Whew! What a morning now onto expanding the Luns as the applications owner requested.