One of my new projects is to get my head round VMware Operations Manager – or VCOPS as its more commonly known. I’d freely admit that performance is one of my weak areas. I’m pretty good at troubleshooting and resolving any number of configuration problems, but resolving performance problems isn’t one of my strengths. Why?
Well, I’ve spent a great many years living in the abstract world of the lab whether that be as trainer, author or now at VMware. In that time I didn’t a get a whole lot of exposure to genuine performance issues – after all a lab environment never experiences the same non-linear performance issues that you see in the real world. What did happen a lot of the times was that students would bring me their performance problems. And I would from principles try to diagnose them. By first principles – I mean things such as:
- Over-use of SMP vCPU
- Disk intensive VMs placed on the same LUN/Volume/Spindles
- The wrong RAID level used
- Insufficient RAM allocated
- Miss-use of various features in vSphere such as poor resource pool design, inappropriate shares and so on
Despite my lack of exposure, one of my favourites parts of the Install & Configure/Fast-Track course were things like Limits/Reservations/Resource Pools/DRS/ESXTOP and so on. Mainly, because these are really juicy topics that can be tricky to explain. I enjoyed the challenge.
So now I’m looking at VCOPS I’m looking at this subject all over again, whilst bearing mind that vCOPS isn’t merely or just a performance monitoring solution. It actively, or rather pro-actively goes looking for “health problems”. There’s two analogies for this. See yourself as Dr VI Admin MD at vSphere Hospital for the Virtual Machine if you like. What VCOPS is giving the pre-emptive, pro-active diagnostic tools to analyse your patients (the VMs), and deal with their minor symptoms before they are really unwell. Another analogy I like is the “dashboard” in your car. Not only does tell you speeds and feeds – there’s also a little red light that gets illuminated when your running low on oil. It’s better to receive pro-active, pre-emptive alerts then be at the side of the road with seized engine.
So anyone thing I’ve been looking at is different tools for generating a fake workload inside a VM – for the four core resources of CPU, Memory, Disk and Network. I thought if I put together a compendium of tools in a single web-page, it might help someone looking to do the same thing.
A couple of really strong observations have come out from my early use of VCOPS. Firstly, vCenter and other 3rd party performance monitoring tools tend to just contain “thresholds” offer a simple “traffic light” view of performance. You know the same tedious green, yellow, red system where alarms are triggered at 75%, 90% and so on. That’s all well and good. The trouble is these 3rd party tools in the main aren’t really showing deviations. So if an application grows in resource usage (say CPU) over a 6hr period from 10%, 20%, 30%, 40%, 50% and 70% – most of them won’t tell you anything. Until the VM has smashed through one of their pre-configured “thresholds” such 75%. By then I think you could argue that problem has got out of hand. Wouldn’t it be far better to alert the administrator to underlying, under-the-surface, iceberg of a problem – rather than waiting for the tip of the iceberg to appear just above the water? I see this very much like a Doctor looking for diagnostic information about the health of patient. When the Doctor monitors the patient they look for tools that can show that a change is taking place, something other than normal. The other thing I’ve liked about VCOPS is how its identified a number of problems in the build of my vSphere homelab. Sometimes those problem have been my own making, other times VCOPS has identified problems in vSphere that has lead to know some known issue in a KB. That for me shows two benefits. VCOPS isn’t just about performance, its about configuration – or should I say “Operations” (the clue is in the name of the product after all!), secondly I can absolutely trust what it telling me – it doesn’t try to pretend that everything is right in the world when it isn’t.
Anyway – less of my ramblings – to the tools overview…. which I’ve catagorized as CPU/Memory, Disk/Network IO, and Application Tools.
CPU and/or Memory Tools
There are couple of tools that can be used to generate CPU activity. Most do some sort of computational task which causes the CPU to be totally maxed out. Not all are SMP aware, so I often think its easier to give the VM a single vCPU because of this. It’s not particularly realistic but they do job. Some of these tools ONLY generate CPU activity – some do both CPU and Memory activity.
CPUbusy (Windows vbs or Linux Perl)
This has been around since ever I was a VMware instructor where there was lab where students would peg to VMs to the SAME physical CPU using CPU Affinity (not DRS needs to be turned off, or DRS disabled for the test VMs). By generating a CPU intensive task, this would show a couple of things – the VM in the main reports 100% utilization when in fact its gets 50% share of the CPU by the other VM that it is in contention with. The other thing you can use the script for is adjusting the CPU resource management controls (Share, Limits, Reservations) to show how its the VMkernel that controls resources. You used to have to run the script with cscript.exe -nologo to stop endless VBS pop-ups. But I found in Windows 2012 R2 simply right clicking and using “Run in a command prompt” run the script in the cmd prompt.
Dim goal
Dim before
Dim x
Dim y
Dim i
goal = 2181818
Do While True
before = Timer
For i = 0 to goal
x = 0.000001
y = sin(x)
y = y + 0.00001
Next
y = y + 0.01
WScript.Echo “I did three million sines in ” & Int(Timer – before + 0.5) & ” seconds!”
Loop
The Perl version looks like this:
#!/usr/bin/perl
# cpubusy.pl
if ($^O =~ /Win/) {
$goal = 2700000;
} else {
$goal = 3000000;
}
while (1) {
$before = time();
for ($i = 0; $i < $goal; $i ++) {
$x = 0.000001;
$y = sin($x);
$y = $y + 0.00001;
}
$y += 0.01;
print “I did three million sines in “, time() – $before, ” seconds!\n”;
}
Is utility which check through a big long list of numbers whether a number is prime number or not. There’s is a Linux version (Prime95 is the Windows version)
StressLinux (Stress and CPULimit)
You can get stress (which seems an odd turn of phrase) as downloadable .tar.bz which extract, and upload to your VMware ESXi host. It’s primarily designed for over-clockers who want to over-clock there CPUs and then test them for reliability. You can get stress as bin/executable which can be installed into most Linux distributions…
CPULimit isn’t a stressing tool but CPU cycles limiting tool. The idea is you can use CPULimit to limit the cycles available to Stress. So you have a process that really wants to hog the processor, but is limited to just 10% of the resource. I’ve seen scripts that attempt to do this such CPU_Load on GitHub https://github.com/ajurge/CPU_load
However, I have yet to see these scripts work properly (that might say as much about my low-level Linux skills). And they also appear to require lots of binaries to be installed… 🙁
Load Storm (Windows – Requires .NET 3.5)
Note: My personal favourite at the moment!
What I like about this is the ability to gradually increase the CPU load until it starts to deviate from VCOPS learned “norms” thus triggering badge changes. The trouble with scripts that eat up all the CPU is that there are sometimes limited in helping you watch who performance monitoring systems react to incrementally increasing demands for CPU resources…
This is a nice simple and easy to use utility, that does allow you to place a % load on the vCPU. Of course its extremely difficult to say to a process use precisely 25% of the CPU, as other process fight for access. But this neat little utility makes a good job of it. It can also handle memory – although this just allows to allocate a chunk of memory, rather than simulate an application with increasing memory demands – commonly referred to as a memory leak.
It’s fair to say tying to apply a specific % CPU load isn’t easy. But Load Storm does come close to expectations. So here I have one thread allocated 70% of CPU. Actually utilisation as shown by task manager is actually 60%. If this was a video you would see it fluctuate between 55-60% usage. Having played around with this utility for a day or so I think you better of spawning multiple threads, and then counting up the totals generates – it seems to give more realistic results.
Of course there’s always a danger in accepting these values from guest operating systems. They after all generally have no clue that they are running under a hypervisor – so what they will report is that receiving % of their allocation. So if a VM is allocated 100mhz, and uses all of it – Task Manager and Co, will generally report 100% utilization, even though that VM may only be using 25% of the core/CPU. Just generally I found that Performance Monitor did a better job of reporting the CPU usage especially since VMware have for sometime supplemented it with perfmon.dll’s that reflect CPU usage as understood in the virtual world (referred to as VM Processor and VM Memory) in Perfmon. As there wasn’t much else running on the VMware ESXi host running my “CorpHQweb01” VM, the % of CPU used in the guest OS and at the VMware ESXi level were much the same – an average of 55%, that’s somewhat less than what Load Storm was configured for (2 threads using 40% of the CPU).
vCOPS was able to show decline in health (due to the increased CPU activity) well before any simplistic “threshold” parameter (typically found in vCenter, and other 3rd party products that poll vCenter for data). So in the charts below the increased CPU activity is seen as deviation below what was the “normal” range (I built up a history of performance data that indicated about 10% CPU activity would be the expected range).
IO Tools:
IOrate (Linux Only)
This tool initially was developed by EMC staff whilst working at FedEX. They had an internal utility that did a very similar job, but it couldn’t be shared externally. The guys working for EMC decide to develop their own tool which could be shared. The utility comes with a number of default “profiles” that generate disk IO for different scenarios and use cases. IOrate uses three configuration files – devices, pattern, and tests file. The Device file controls which disks IOrate will have access to. Patterns file describes the different profile types used – lots of small writes, lots of small reads, big writes, big reads, lots of reads with low writes – together with different blocks sizes to use and so on. Test describes how these patterns are called and used.
In my case I had to use fdisk -l in my Linux VM (running CentOS) to find out what partitions/disks were mounted, and then edit the devices.ior file, to specify it was OK to use that disk.
# UNIX raw device example
Device = “/dev/mapper/VolGroup-lv_home” capacity 100GB;
It was only then I was able to run IOrate with the following command:
./iorate -p patterns.ior -f devices.ior -t tests.ior
This particular test isn’t particularly aggressive, and there are some other tests different disk activity including test-asts.ior, tests-fx.ior, tests-var.ior. Of course you can create your own tests, by copying these samples, and creating ones of your own. If you want these samples to work for longer a simple loop script will cause the script to repeat over and over again. This can be created with nano or vi, and saved as bash .sh script and executed with something like sh repeat.sh
#!/bin/bash
COUNTER=0
while [ $COUNTER -lt 10 ]; do
./iorate -p patterns.ior -f devices.ior -t tests.ior
let COUNTER=COUNTER+1
done
Finally, one thing I’d say is that is despite its name IOrate can be as CPU intensive as its disk intensive. So it can generate more contention on the CPU than on the disk.
iometer (Windows & Linux)
IOmeter is probably the most well-known of stress testing tools. I’ve seen it used in training course for more than decade, and is often used in demos/presentations when a resource demand is required. I setup IOmeter with a default “Access Specification” and it promptly filled my D: partition which is just 40GB in size. This because the default setting of “maximum disk size” is set to 0 which means fill the whole of the disk – good job I select D and not C then!
Fortunately, VCOPS did an excellent job of alerting me to this fact!
Application Tools:
One of the things I’m particularly interested is stressing tools that generate activity at an application level. As I consider this more realistic, and affords the opportunity to model expected activity against an application based on potentially real-world information. Database have always been good candidate for these tools as its a relatively simple process for the people who develop theses tools to generate CPU activity from queries and indexing threads, and disk activity through the process of writing bulk updates to database tables, and refreshing indexes again. My concern is always with “realism” in my fake-lab-world.
Microsoft SQL Server SQLIOSim Simulator Tool (Windows Only)
This utility has been around for some years in various guises and under various names. You meant to be able to find it within the Microsoft SQL folders of BINN. However, I couldn’t locate it in my Microsoft SQL 2008 installation. Instead I end-up downloading it from Microsofts website. The 64-bit version for Windows is located at: http://support.microsoft.com/kb/231619
I would recommend downloading and extracting it to separate partition to your C: drive as it appears the default test create and fill temporary DB files, and the SQLIOSim utility seems to use some locally extracted files for this purpose. The default test does not run for a very long period of time, but its enough to generate activity to trigger vCOPS health data changes.
What I liked about this utility was the ability to re-run the same stress, and identify that Disk I/O was the constraining resource. That made set about moving the VM via Storage VMotion to different classes of storage and measuring the length of time it took to complete each test:
The clearly my SSD backed storage (Synology) out-performs my SATA backed storage (IOMEGA). Although there’s possibility that once the script is running on SSD, some other resource is constraining resource such as CPU or memory. The interesting thing is the Synology box with ISCSI support VAAI, but with NFS it doesn’t (I’ve yet to upgrade my firmware which apparently adds VAAI to NFS for the Synology). But the NFS out-performed it. I suspect the disk activity generated by the script is helped by VAAI – because other disk activity generate via vSphere – such as the deploying a template gives different clone times in vCenter – 7mins with NFS, and 23second (yes 23secs!) to deploy a new Windows 2012 R2 VM.
HammerDB (Windows/Linux)
I found HammerDB a little trickier to be setup. You have to tell it how to connect to your various databases, and many people recommend creating a DB upfront, rather than using its default methods – as otherwise you don’t get an optimised database to begin with. HammerDB supports both Windows and Linux, and range of very different database formats – Oracle, SQL Server, TimesTen, PostgreSQL, Greenplum, Postgres Plus Advanced Server, MySQL and Redis.
Microsoft Exchange Server Jetstress 2013 Tool
Sadly, I haven’t had the time to play with this utility. I struggled to get Exchange installed in my lab this week. In the end I just gave up. But one day when I have renewed patience I will check this out. It does look quite handy. In the mean time I would check out msexchange.org’s article all about it.