If you do a lot of work with NetApp filers, you've no doubt used NetApp's SnapMirror replication tool. The product is inherently thin-provisioned, so it's common to have replicated volumes running at a high utilization rate. It's great to see that, but high utilization rates are a double-edged sword. With them, sometimes it's hard to avoid running out of space. And that, of course, is bad news.
At one customer site where I worked, we confronted that exact problem -- ran out of space in the aggregate (an aggregate is NetApp's terminology for a virtualized pool of physical disks). Fixing the problem was no simple task. I tried a number of approaches until I finally found one that actually worked.
Here's what worked, what didn't and why:
After realizing the aggregate was out of space, my first step was to perform a volume copy with an "-s" command parameter to base the copy on an existing snapshot. I chose to try this since, if it worked, no snapshots would need to be created and I'd be able to copy the replicated volume from one aggregate to another. Unfortunately, this approach didn't work; it did copy the contents of the volume but it didn't copy all the snapshot information -- problematic since NetApp's Fibre Attached Storage (FAS) system uses snapshots to know where it is in the stream of time for data replication.
After volume copy failed, I tried ndmpcopy, and I tried using SnapMirror to replicate to another volume in another aggregate that had available space. Both operations failed; they require a snapshot to begin their operation, and the filer simply didn't have enough disk capacity to create even a tiny snapshot to begin the replication.
The only alternatives left were to grow the aggregate, destroy an existing aggregate that had no volumes, and delete snapshots in the volumes in the out-of-space aggregate.
The first two alternatives were problematic. The problem with growing the aggregate is that once disks are added to an aggregate, you can't reclaim them without destroying the aggregate. This option wasn't appealing.
I next considered the possibility of destroying an existing aggregate that had no volumes in it, to reclaim those disks. But, when I looked at the aggregates that were empty, the disks that composed those aggregates were much smaller than the disks in the aggregate that was out of space. If you put disks of different sizes in one aggregate, the OnTap OS will "rightsize" the disks, which means it will use the smallest disk as a common-denominator size. So if you have a pool of 300 GB drives and you add a shelf of 500 GB drives to this aggregate, all the 500 GB drives will show up as 300 GB. We couldn't afford to lose all that space, so I discounted that option.
The only other option I had was to delete snapshots in the aggregate that was out of space. I began investigating snapshot sizes. Since snapshots are cumulative, looking at snapshots from the oldest to the youngest gives an accurate picture of how much space can be reclaimed by deleting snapshots. I noticed there were two snapshots that together held about 1 TB of data. I deleted those snapshots. Once I did that, WAFL began reporting free disk space within 30 minutes.
This available space allowed WAFL to create a snapshot so that I could replicate the volume. Once I did that, I swapped volume names since the management software manages the relationship via volume name. After I was sure the source filer was replicating to the destination filer without any problem, I deleted the old replicated volume to regain space within the aggregate. Problem solved.
If an aggregate isn't completely out of space but just nearly out of space, you obviously need to take action, but consider yourself lucky. As long as SnapMirror can create a snapshot, you can solve the space problem by creating a volume in a different aggregate, restricting it and replicating the data. In this situation, the source filer (Filer A, Volume A) replicates to the destination filer (Filer B, Volume A), which in turn replicates to the newly created volume on the destination filer (Filer B, Volume B). Once Volume B is completely in sync with Filer B, Volume A, you should be able to point the source filer (Filer A, Volume A) to the newly created volume (Filer B, Volume B) and destroy the old volume (Filer B, Volume A). Once you destroy the volume, it will at first look like you haven't gained any space back in the nearly full aggregate, but WAFL just needs a few minutes to free up the data blocks that were locked down in the deleted volume.
In the future, NetApp users will have an easier time dealing with this kind of problem. With the company's next-generation GX system, if the aggregate begins to run out of space, the OS will allow you to simply move the replicated volume from one storage pool to another seamlessly. So you will be able to run storage at a much higher level of utilization with little risk as long as you have other aggregates with free space.
About the author
Seiji Shintaku is a principal consultant for RTP Technology. Before joining RTP Technology, he was global NetApp engineer for Lehman Brothers, Celerra and DMX engineer for Credit Suisse First Boston, principal consultant for IBM, and global Windows engineer for Morgan Stanley. He can be reached at firstname.lastname@example.org. RTP Technology is a VAR for NetApp, EMC, F5, VMware and Quantum. The company also provides professional services for storage deployments. It can be reached at (201) 796-2266.