Update July 7th, 2005
From the referrer logs I see that this post comes up fairly often in Google searches. I'm updating it with a link to a later post I wrote called 'More on performance vs. load testing'.
Performance testing
The goal of performance testing is not to find bugs, but to eliminate bottlenecks and establish a baseline for future regression testing. To conduct performance testing is to engage in a carefully controlled process of measurement and analysis. Ideally, the software under test is already stable enough so that this process can proceed smoothly.
A clearly defined set of expectations is essential for meaningful performance testing. If you don't know where you want to go in terms of the performance of the system, then it matters little which direction you take (remember Alice and the Cheshire Cat?). For example, for a Web application, you need to know at least two things:
- expected load in terms of concurrent users or HTTP connections
- acceptable response time
- at the application level, developers can use profilers to spot inefficiencies in their code (for example poor search algorithms)
- at the database level, developers and DBAs can use database-specific profilers and query optimizers
- at the operating system level, system engineers can use utilities such as top, vmstat, iostat (on Unix-type systems) and PerfMon (on Windows) to monitor hardware resources such as CPU, memory, swap, disk I/O; specialized kernel monitoring software can also be used
- at the network level, network engineers can use packet sniffers such as tcpdump, network protocol analyzers such as ethereal, and various utilities such as netstat, MRTG, ntop, mii-tool
However, testers also take a black-box approach in running the load tests against the system under test. For a Web application, testers will use tools that simulate concurrent users/HTTP connections and measure response times. Some lightweight open source tools I've used in the past for this purpose are ab, siege, httperf. A more heavyweight tool I haven't used yet is OpenSTA. I also haven't used The Grinder yet, but it is high on my TODO list.
When the results of the load test indicate that performance of the system does not meet its expected goals, it is time for tuning, starting with the application and the database. You want to make sure your code runs as efficiently as possible and your database is optimized on a given OS/hardware configurations. TDD practitioners will find very useful in this context a framework such as Mike Clark's jUnitPerf, which enhances existing unit test code with load test and timed test functionality. Once a particular function or method has been profiled and tuned, developers can then wrap its unit tests in jUnitPerf and ensure that it meets performance requirements of load and timing. Mike Clark calls this "continuous performance testing". I should also mention that I've done an initial port of jUnitPerf to Python -- I called it pyUnitPerf.
If, after tuning the application and the database, the system still doesn't meet its expected goals in terms of performance, a wide array of tuning procedures is available at the all the levels discussed before. Here are some examples of things you can do to enhance the performance of a Web application outside of the application code per se:
- Use Web cache mechanisms, such as the one provided by Squid
- Publish highly-requested Web pages statically, so that they don't hit the database
- Scale the Web server farm horizontally via load balancing
- Scale the database servers horizontally and split them into read/write servers and read-only servers, then load balance the read-only servers
- Scale the Web and database servers vertically, by adding more hardware resources (CPU, RAM, disks)
- Increase the available network bandwidth
In a standard test environment such as a test lab, it will not always be possible to replicate the production server configuration. In such cases, a staging environment is used which is a subset of the production environment. The expected performance of the system needs to be scaled down accordingly.
The cycle "run load test->measure performance->tune system" is repeated until the system under test achieves the expected levels of performance. At this point, testers have a baseline for how the system behaves under normal conditions. This baseline can then be used in regression tests to gauge how well a new version of the software performs.
Another common goal of performance testing is to establish benchmark numbers for the system under test. There are many industry-standard benchmarks such as the ones published by TPC, and many hardware/software vendors will fine-tune their systems in such ways as to obtain a high ranking in the TCP top-tens. It is common knowledge that one needs to be wary of any performance claims that do not include a detailed specification of all the hardware and software configurations that were used in that particular test.
Load testing
We have already seen load testing as part of the process of performance testing and tuning. In that context, it meant constantly increasing the load on the system via automated tools. For a Web application, the load is defined in terms of concurrent users or HTTP connections.
In the testing literature, the term "load testing" is usually defined as the process of exercising the system under test by feeding it the largest tasks it can operate with. Load testing is sometimes called volume testing, or longevity/endurance testing.
Examples of volume testing:
- testing a word processor by editing a very large document
- testing a printer by sending it a very large job
- testing a mail server with thousands of users mailboxes
- a specific case of volume testing is zero-volume testing, where the system is fed empty tasks
- testing a client-server application by running the client in a loop against the server over an extended period of time
- expose bugs that do not surface in cursory testing, such as memory management bugs, memory leaks, buffer overflows, etc.
- ensure that the application meets the performance baseline established during performance testing. This is done by running regression tests against the application at a specified maximum load.
In the context of load testing, I want to emphasize the extreme importance of having large datasets available for testing. In my experience, many important bugs simply do not surface unless you deal with very large entities such thousands of users in repositories such as LDAP/NIS/Active Directory, thousands of mail server mailboxes, multi-gigabyte tables in databases, deep file/directory hierarchies on file systems, etc. Testers obviously need automated tools to generate these large data sets, but fortunately any good scripting language worth its salt will do the job.
Stress testing
Stress testing tries to break the system under test by overwhelming its resources or by taking resources away from it (in which case it is sometimes called negative testing). The main purpose behind this madness is to make sure that the system fails and recovers gracefully -- this quality is known as recoverability.
Where performance testing demands a controlled environment and repeatable measurements, stress testing joyfully induces chaos and unpredictability. To take again the example of a Web application, here are some ways in which stress can be applied to the system:
- double the baseline number for concurrent users/HTTP connections
- randomly shut down and restart ports on the network switches/routers that connect the servers (via SNMP commands for example)
- take the database offline, then restart it
- rebuild a RAID array while the system is running
- run processes that consume resources (CPU, memory, disk, network) on the Web and database servers
Conclusion
I am aware that I only scratched the surface in terms of issues, tools and techniques that deserve to be mentioned in the context of performance, load and stress testing. I personally find the topic of performance testing and tuning particularly rich and interesting, and I intend to post more articles on this subject in the future.