SANs and Excessive Disk Utilization? 83
pnutjam asks: "I work for a small to medium mental health company as the Network Administrator. While I think a SAN is a bit of overkill for our dozen servers, it was here when I got here. We currently boot 7 servers from our SAN, which houses all of their disks. Several of them have started to show excessive disk load, notably our SQL server, and our old domain controller (which is also the file/print server). I am in the process of separating our file/print server from our domain controller, but right now I get excessive disk load during the morning when people log on (we use roaming profiles). I think the disks need to be defragged, but should this be done on the servers, or on the SAN itself? When it comes to improving performance, I get conflicting answers when I inquire whether I would get better throughput from newer fibre-channel cards (ours are PCI-x, PCI-e is significantly faster), or mixing in some local disks, or using multiple fibre channel cards. Has anyone dealt with a similar situation or has some expertise in this area?"
Storage Area Network (Score:3, Informative)
Re: (Score:2)
Re:No Acronyms! (Score:4, Informative)
----
The whole SAN part is a red herring. He just has a storage area network (presumably Fibre Channel, as opposed to iSCSI), which just is a means of connecting servers to storage enclosures. The storage protocol is still SCSI, it's just over a different transport layer.
In other words, he has multiple servers connected to a single storage enclosure, and he's seeing capacity and performance issues.
The disks should be considered just like internal disks: defrag from the respective servers.
I would bet that his problem is simply having insufficient disks (spindles) to serve the morning peak workload... just like if you had a few internal disks.
In short:
- Defrag from each server, if you have a fragging issue
- Add more disks to spread the workload out
- Consider leaving the boot disks in each server, and just put data on the san. One main reason is that swapping to the SAN can be a problem by consuming storage enclosure cache (presuming there is any)
Re: (Score:2)
If you don't know what a SAN is, and are too lazy to consult Google, then why post? He's asking for someone who might be able to help, not trying to teach a lesson.
There are two reasons why Slashdot posts 'ask slashdot' questions (that I can think of). One is to get an answer for the original poster (a minor point). The other is so that the other million or so readers have a chance to show off their knowledge and/or learn something new.
The latter is actually a lot more important (for the site as a whole) than the former.
Re: (Score:2)
Re: (Score:2)
Yeah, that'd be good style. But unless you intend to undertake an editorial Jihad on Slashdot and strafe every article that makes slips like this, this is probably not a good place to start. It's a request for specific information in a specialized area. Seems likely that anyone who can contribute anything useful will know what SAN means.
Don't let me discourage you from making u
Who's San Box is it? (Score:5, Informative)
Re: (Score:1, Informative)
Disk enclosures, from low end to high end, simply employ differing RAID levels to present logical disks as sequential block extents (ranges). A disk enclosure is not in the business of layout... block 001 is immediately before block 002 (although might be RAID-1 or RAID-5 or RAID-6 on the backend).
---
Also, to the submitters question: Throughput is rarely the issue with SANs, 1Gb or more is more than adequate for most apps. The bottle
Re: (Score:2)
Re: (Score:1)
the problems started after splitting the pdc and the file/print sharing. where are the profiles stored? on the pdc or on the file/print server? try moving them to the server that really needs them at login time.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Then I have two words for you: "support ticket."
Open a ticket with their tech support. Tell them about the diagnostic steps you've taken. If possible, get someone to come out and examine the box or do maintenance on it for you.
Re: (Score:2)
Re: (Score:2, Interesting)
Or are you perhaps thinking of NAS (Network attached storage) devices?
Re: (Score:1)
Re: (Score:2)
But yeah, with a SAN you're talking about something that provides a block level interface to storage. Fragmentation is a filesystem level issue.
Re: (Score:2, Informative)
Re: (Score:2)
That seems like an odd configuration (Score:4, Insightful)
Re: (Score:2)
Re: (Score:1)
Then again, it truly sounds like he probably needs to review his SAN architecture. I'd probably have the DB on its own set of spindles, and having 2 domain controllers, with the primary being standalone and the secondary (and potentially tertiary if needed) doubling as print servers. Other than that, we'd need a lot more i
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
What kind of SAN? (Score:2)
Re:What kind of SAN? (Score:4, Informative)
Re: (Score:2, Informative)
1. How many target ports on the magnitude 3d? For that model, I'm not sure but they are probably 2gbit each. Try to balance load across the ports via multipathing software or manual balancing (server a uses port 1, server b uses port 2, etc).
2. What is your SAN switch topology? If hopping across ISLs make sure that you have an adequate amount of trunked bandwidth between the switches.
3. What speed are your SAN switches? Using 1gbit switches would bottleneck a lot fas
Re: (Score:2)
One thing to keep in mind (Score:2)
Seconded on the suggestion to call Xiotech. They know their stuff and should be able to help you out.
It's kind of funny - I'm at Novell BrainShare, and my fourth session of the day was how to diagnose poor server performance due to SAN congestion. In NetWare we have always had tools to measure h
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Glad to have been of help :-) (Score:2)
Re: (Score:1, Offtopic)
Not enough information (Score:5, Insightful)
Also, while "have you tried defragging?" is a common home troubleshooting tip, it's not clear how you came up with the idea that the SAN has to be defragged. If you have reasons and you're just simplifying to keep the post short, great. Defrag away according to the SAN manufacturer's recommendations. However, don't become obsessed with it unless you know that fragmentation's an issue.
You need to spend some time benchmarking the whole system. Figure out how much disk, processor, network IO, and SAN IO are being used. Know what percentage of the total that is. Figure out exactly which servers are causing performance problems at which times.
"Find the problem" is always the first step in "fix the problem."
Once you know what's going on, you can deal with the problem intelligently. Are all the servers booting at the same time? Give them different spindles to work from or stagger the boot times. Are all of the users logging in at once? Figure out why that's slow (network speed, SAN, data size, etc.) and split the data across multiple servers and SANS or improve the hardware.
If you can make the case with hard data that the SAN is swamped, you can probably pry money from management to fix the problem. However, guessing that it -might- be something won't get you very far. They don't want to spend $20k on a fix to be told, "Nope. It was something else."
Re:Not enough information (Score:4, Interesting)
I finally found an SNMP query for "disk load". This purports to be a percentage, but I've seen it showing way over 100, sometimes as high as four or five hundred. If it gets above 50 or 60 people start to complain. My disk load spikes in the morning when people are logging in, it generally goes to about 80% or higher on my graphs [google.com]. My SQL server doesn't have these problems and I have yet to find a suitable way of monitoring the SQL log where I think the problem is originating.
Re: (Score:1)
There's lots of tuning that can take place on the server side before you start re-striping. That being said, more spindles will likely help on the storage side.
A couple of
Re: (Score:2)
Monitor and analyze a few common metrics on your servers. Physical Disk IO Bytes/sec can help you determine whether the FC HBAs are a bottleneck; a 2Gb/s HBA is good for (at most) 200MB/s either direction; are
stop guessing (Score:2)
I can definitely say that without vendor make, model, and software version information, you're not likely to get much helpful information in this venue, and you properly ought to be going to the vendor for technical support.
Performance Troubleshooting (Score:1)
Defrag explained (Score:1)
What is "defrag"?
As a file grows, pieces of it may be strewn across the disk, causing the head to seek back and forth across the disk while reading it. This happens faster on some file systems than on others, and it happens faster on disks that are more than half full. Defragmentation [wikipedia.org] assembles the pieces of each file into one piece for faster access. Some defrag programs can also put related files next to one another.
Re: (Score:2)
It's when you kill a team-mate and lose a point.
hmm.. (Score:5, Informative)
Re: (Score:2)
Hmm, are you a friend of Essjay [wikipedia.org]?
Wrong side of the problem (Score:3, Interesting)
You state that the disk load is high in the morning when everyone logs in with roaming profiles, which suggests to me that the roaming profiles are way too large.
Depending on the Windows versions used, move the contents of the "My documents" folder to their personal network shares (give them one if they don't have any), tell them to move data in their Desktop folder to that share and only create shortcuts, maybe even create a mandatory quota limit on the clients.
Check your favorite search site on "Windows reduce roaming profile size" for more tips.
Re: (Score:2)
Re: (Score:2, Informative)
Re: (Score:2)
Sounds familiar (Score:2)
We boot some hosts from our SAN (McData SAN switches, IBM SVC, Multiple DS4800's). First, is it your SAN that's bottlenecking (the switches?), the storage controller, or the hosts? When they bottleneck, are you seeing a lot of paging. Large roaming profiles loading all at once could be causing you to page, since your swap is out on your storage controller, you're doing double duty, and paying a penalty for it. As o
Re: (Score:2)
Our page file is on a dedicated partition on the SAN also. I do notice that it is usually 80% utilized and at night when our backups run it goes close to 100%. Our diskload also spikes at that time, but not as high as it does in the morning. When I get the high diskload spi
Re: (Score:2)
Defragging at the OS level might help you, and there's no harm in it. The more sequential (at both a logical placem
Re: (Score:1)
Disks are a bit more complicated that processors or memory in terms of measuring how much of their 'performance' is in use.
Factors to look at, include.... well most of the ones you'll see under 'PhysicalDisk' in Windows perfmon.
I/Os per sec and bytes transferred per sec are of interest, but the one that's _really_ in indicator in terms of performance is disk queue length. A long queue, means that for 'what
SANs (Score:4, Informative)
First question, is what's the symptoms of the problem - how do you know you're 'pegging your disks'? If you're seeing IO load to your HBAs being really high, then yes, you might find that you need to upgrade these. From experience though, HBAs are rarely your limiting factor.
Much more likely is that you're experiencing local disk fragmentation, as you correctly point out. I can't offer specific advise for your array, but in my experience, SANs are 'blind' to filesystems. They work on disks and LUNs. LUNs are the devices a host sees. This can be safely and easily be defragmented, in all the normal ways that you would do normally.
Are you accessing your SAN over fiber channel or iSCSI? IF it's fiber, then again, you _may_ have network contention, but it's unusual in my experience (especially on a 17 servre SAN). If it's network, then you have contention to worry about. Is it possible that your 'gimme profile' requests across your network are also contending with your iSCSI traffic?
You may find that your SAN has 'performance tools' built in. That's worth a look, to see how busy your spindles are. Because of the nature of a SAN, you may find that the LUNs are being shared on the same physical disks. This can be a real problem if you've done something scary like using windows dynamic disk to grow your filesystem - Imagine having two LUNS striped, when in acutality on the back end, they're on two different 'bits' of a RAID 5 set. This is bad, and is worth having a look at.
One place where SANs do sometimes have issues is in page files. Which is possibly a problem if you're SAN booting. SANs have latency, and windows doesn't like high latency on page files. If you really push it, it'll start bluescreening.
This is fixed by local disks for OS, or just moving swap file to local disk.
HBA expansion _might_ improve performance, assuming this is your bottleneck. However you'll need to ensure you are multipathing your HBAs. (Think of them like network cards, and you won't go far wrong - you need to 'cheat' a bit in order to share network bandwidth on multiple cards). But like I say, you probably want to check this is actually a problem. If they're not very old, then it's unlikely, although it might be worth checking which internal bus the HBAs are on. (Resilience and contention).
It's possible your SAN is fragmented, but it's unlikely this is your problem - SANs don't have the same problem with adding and deleting files (LUNs) so all your backend storage will be in contiguous lumps anyway.
And I apologise if I use terminology that you're not familiar with. Each SAN vendor seems to have their own nomenclature when it comes to the 'bits', but they all work in roughly the same way. You have disks, which are ... well disks. RAID groups, which are disks bundled together, with a RAID 1, RAID 1+0, RAID 5 (with variable numbers of parity ratio) and very occasionally RAID 0. You have LUNs. Logical Units. These are ... well, chunks of your bundles of disks. The first 100Mb of a 5 disk RAID 5 group, might be a LUN. The LUN is what the host 'sees', as a single atomic volume. Most disk groups can have multiple LUNs on them, which is why you do need to watch out for how volume management is operating. I have seen a case where a Windows 2000 server added a second LUN, and used dynamic disk to stripe. Not realising that on the back end, both those LUNs were on the same RAID 5 (4+1). Which cause the disks to seek back and forth continually, and really hurt performance.
Oh, and this is also probably a good excuse to be booking SAN training. IMO SANs are fun and interesting, not to mention in demand and well paid :)
Re: (Score:1)
Getting someone in to fix it _may_ end up being the right choice, but it does help to check first, where the problem lies - there's no point in getting a 'SAN expert' in if your problem is merely filesystem fragmentation.
SANS don't need to be defragged... (Score:2)
1. Presenting physical spindles to the server as raw disks -or-
2. Presenting a RAID volume to the server, which consists of a section of many disks.
All SAN vendors that I'm aware of allocate LUNs as contiguous areas of disk. It's faster this way because heads don't have to seek very far to find data within the same LUN. Ev
Re: (Score:2)
I'd tend to agree, not usually a problem. But then if the storage controller has been in place for a long time, with multiple admins, hosts added & deleted, etc. The (mis)management of it over the years could have lead to lots of litt
Re: (Score:2)
Re: (Score:1)
a) too little information
b) no good reason to associate the problem with the SAN and
c) a noted problem with the profiles you are using.
SAN & "Disk" load (Score:1)
You need to knwow the actual layout on the SAN's physical disk, that is how many spindles are available for each of your servers and which servers use the same set of spindles.
The most likely cause of bad performance is that the same spindles are overloaded while some other do nothing, as it is very rare to have the link elements (fibre and cards) over loaded. As one other poster noted you need to know the load on your disk to decide if the link may be in cause, for example are you doing more than 1 Gb/s IO
SQL is a Memory Hog (Score:2)
An SQL server doing something that is too big for it can get you in "slower than my last 486" territory pretty quick.
I know very little about SANs, but assume the file system is pretty fast... so maybe it's not the problem at all.
SANS and SQL (Score:2)
Where I would take a look is at your RDBMS. If you're getting 80% disk utilization at the SAN you may be doing far more sequential/full-table scans than you need to be. Turn explain plan on and start looking for opportunities to add indexes.
Finally, check virtual memory on the c
Re: (Score:2)
I'll try to post here if that helps or nor.
Re: (Score:2)
A few answers (Score:4, Informative)
2) It is far more likely your OS needs defragging than your disk array. Your disk array CAN become fragmented if you add and delete LUNs often, though.
3) Yes, you need multiple fibre cards, but for redundancy, not for bandwidth.
4) Try and put your major workloads on their own RAID arrays on your disk controller.
5) Check to see if you have enough memory in those boxes. If you have one server that keeps swapping out to disk and you are booting from SAN, you are going to get very hosed, very quickly. If these boxes have any internal disk at all, put the swap there.
6) If it is possible with your arrays, max out the segment size. (Engenio/LSI - based arrays can do this.)
This should be enough to get you started.
SirWired
RAM RAM RAM (Score:2)
Doesn't sound like a framentation problem... (Score:2)
Re: (Score:1)
Duh - Call the manufacturer (Score:2)
Sheesh...