Thursday, November 25, 2010

Performance vs. load vs. stress testing

Here's a good interview question for a tester: how do you define performance/load/stress testing? Many times people use these terms interchangeably, but they have in fact quite different meanings. This post is a quick review of these concepts, based on my own experience, but also using definitions from testing literature -- in particular: "Testing computer software" by Kaner et al, "Software testing techniques" by Loveland et al, and "Testing applications on the Web" by Nguyen et al.

Update July 7th, 2005

From the referrer logs I see that this post comes up fairly often in Google searches. I'm updating it with a link to a later post I wrote called 'More on performance vs. load testing'.

Performance testing

The goal of performance testing is not to find bugs, but to eliminate bottlenecks and establish a baseline for future regression testing. To conduct performance testing is to engage in a carefully controlled process of measurement and analysis. Ideally, the software under test is already stable enough so that this process can proceed smoothly.

A clearly defined set of expectations is essential for meaningful performance testing. If you don't know where you want to go in terms of the performance of the system, then it matters little which direction you take (remember Alice and the Cheshire Cat?). For example, for a Web application, you need to know at least two things:

expected load in terms of concurrent users or HTTP connections
acceptable response time

Once you know where you want to be, you can start on your way there by constantly increasing the load on the system while looking for bottlenecks. To take again the example of a Web application, these bottlenecks can exist at multiple levels, and to pinpoint them you can use a variety of tools:

at the application level, developers can use profilers to spot inefficiencies in their code (for example poor search algorithms)
at the database level, developers and DBAs can use database-specific profilers and query optimizers
at the operating system level, system engineers can use utilities such as top, vmstat, iostat (on Unix-type systems) and PerfMon (on Windows) to monitor hardware resources such as CPU, memory, swap, disk I/O; specialized kernel monitoring software can also be used
at the network level, network engineers can use packet sniffers such as tcpdump, network protocol analyzers such as ethereal, and various utilities such as netstat, MRTG, ntop, mii-tool

From a testing point of view, the activities described above all take a white-box approach, where the system is inspected and monitored "from the inside out" and from a variety of angles. Measurements are taken and analyzed, and as a result, tuning is done.

However, testers also take a black-box approach in running the load tests against the system under test. For a Web application, testers will use tools that simulate concurrent users/HTTP connections and measure response times. Some lightweight open source tools I've used in the past for this purpose are ab, siege, httperf. A more heavyweight tool I haven't used yet is OpenSTA. I also haven't used The Grinder yet, but it is high on my TODO list.

When the results of the load test indicate that performance of the system does not meet its expected goals, it is time for tuning, starting with the application and the database. You want to make sure your code runs as efficiently as possible and your database is optimized on a given OS/hardware configurations. TDD practitioners will find very useful in this context a framework such as Mike Clark's jUnitPerf, which enhances existing unit test code with load test and timed test functionality. Once a particular function or method has been profiled and tuned, developers can then wrap its unit tests in jUnitPerf and ensure that it meets performance requirements of load and timing. Mike Clark calls this "continuous performance testing". I should also mention that I've done an initial port of jUnitPerf to Python -- I called it pyUnitPerf.

If, after tuning the application and the database, the system still doesn't meet its expected goals in terms of performance, a wide array of tuning procedures is available at the all the levels discussed before. Here are some examples of things you can do to enhance the performance of a Web application outside of the application code per se:

Use Web cache mechanisms, such as the one provided by Squid
Publish highly-requested Web pages statically, so that they don't hit the database
Scale the Web server farm horizontally via load balancing
Scale the database servers horizontally and split them into read/write servers and read-only servers, then load balance the read-only servers
Scale the Web and database servers vertically, by adding more hardware resources (CPU, RAM, disks)
Increase the available network bandwidth

Performance tuning can sometimes be more art than science, due to the sheer complexity of the systems involved in a modern Web application. Care must be taken to modify one variable at a time and redo the measurements, otherwise multiple changes can have subtle interactions that are hard to qualify and repeat.

In a standard test environment such as a test lab, it will not always be possible to replicate the production server configuration. In such cases, a staging environment is used which is a subset of the production environment. The expected performance of the system needs to be scaled down accordingly.

The cycle "run load test->measure performance->tune system" is repeated until the system under test achieves the expected levels of performance. At this point, testers have a baseline for how the system behaves under normal conditions. This baseline can then be used in regression tests to gauge how well a new version of the software performs.

Another common goal of performance testing is to establish benchmark numbers for the system under test. There are many industry-standard benchmarks such as the ones published by TPC, and many hardware/software vendors will fine-tune their systems in such ways as to obtain a high ranking in the TCP top-tens. It is common knowledge that one needs to be wary of any performance claims that do not include a detailed specification of all the hardware and software configurations that were used in that particular test.

Load testing
We have already seen load testing as part of the process of performance testing and tuning. In that context, it meant constantly increasing the load on the system via automated tools. For a Web application, the load is defined in terms of concurrent users or HTTP connections.

In the testing literature, the term "load testing" is usually defined as the process of exercising the system under test by feeding it the largest tasks it can operate with. Load testing is sometimes called volume testing, or longevity/endurance testing.

Examples of volume testing:

testing a word processor by editing a very large document
testing a printer by sending it a very large job
testing a mail server with thousands of users mailboxes
a specific case of volume testing is zero-volume testing, where the system is fed empty tasks

Examples of longevity/endurance testing:

testing a client-server application by running the client in a loop against the server over an extended period of time

Goals of load testing:

expose bugs that do not surface in cursory testing, such as memory management bugs, memory leaks, buffer overflows, etc.
ensure that the application meets the performance baseline established during performance testing. This is done by running regression tests against the application at a specified maximum load.

Although performance testing and load testing can seem similar, their goals are different. On one hand, performance testing uses load testing techniques and tools for measurement and benchmarking purposes and uses various load levels. On the other hand, load testing operates at a predefined load level, usually the highest load that the system can accept while still functioning properly. Note that load testing does not aim to break the system by overwhelming it, but instead tries to keep the system constantly humming like a well-oiled machine.

In the context of load testing, I want to emphasize the extreme importance of having large datasets available for testing. In my experience, many important bugs simply do not surface unless you deal with very large entities such thousands of users in repositories such as LDAP/NIS/Active Directory, thousands of mail server mailboxes, multi-gigabyte tables in databases, deep file/directory hierarchies on file systems, etc. Testers obviously need automated tools to generate these large data sets, but fortunately any good scripting language worth its salt will do the job.

Stress testing

Stress testing tries to break the system under test by overwhelming its resources or by taking resources away from it (in which case it is sometimes called negative testing). The main purpose behind this madness is to make sure that the system fails and recovers gracefully -- this quality is known as recoverability.

Where performance testing demands a controlled environment and repeatable measurements, stress testing joyfully induces chaos and unpredictability. To take again the example of a Web application, here are some ways in which stress can be applied to the system:

double the baseline number for concurrent users/HTTP connections
randomly shut down and restart ports on the network switches/routers that connect the servers (via SNMP commands for example)
take the database offline, then restart it
rebuild a RAID array while the system is running
run processes that consume resources (CPU, memory, disk, network) on the Web and database servers

I'm sure devious testers can enhance this list with their favorite ways of breaking systems. However, stress testing does not break the system purely for the pleasure of breaking it, but instead it allows testers to observe how the system reacts to failure. Does it save its state or does it crash suddenly? Does it just hang and freeze or does it fail gracefully? On restart, is it able to recover from the last good state? Does it print out meaningful error messages to the user, or does it merely display incomprehensible hex codes? Is the security of the system compromised because of unexpected failures? And the list goes on.

Conclusion

I am aware that I only scratched the surface in terms of issues, tools and techniques that deserve to be mentioned in the context of performance, load and stress testing. I personally find the topic of performance testing and tuning particularly rich and interesting, and I intend to post more articles on this subject in the future.

Web load testing (article 2) - be generous with pacing

This is the second article explaining a practical approach to
avoiding misunderstandings concerning concurrent user load.

If you read the first article, you will be aware of my obsession with achieving even pacing for each test user. Here I will explain how to design tests to achieve even pacing.

First, a digression on thinking time and system time.

Most load testing tools allow the user thinking time to be simulated. In fact, most allow thinking time to be recorded during a user session so it can be replayed during the load test. I always include thinking time in the web session. This ensures that the total duration of each user session roughly corresponds with reality. This is important where the web system maintains sessions for users – you want the number of concurrent sessions to be realistic.

I also like to use transaction timers to measure ALL activity that is not user thinking time. In other words, I use timers to measure all system activity. Not surprisingly, I call this “system time”.

I estimate the minimum system time by running the web test script for a small number of users and adding up the times of all transactions. I multiply the system time by three to allow for slow-down under load, and add the thinking time. Then I add a comfortable margin for error, and end up with a pacing time that should be achieved– even under heavy load.

So there is my tip – allow plenty of time for each user session. Be generous with your pacing. I do everything I can to ensure there is a pause between user sessions so pacing is maintained.

The astute amongst my readers will have worked out that being generous with pacing implies a larger number of test users. Let me explain. To achieve a specified load (say 720 user sessions started per hour) with a pacing of 5 minutes requires 60 test users, whereas with a pacing of 10 minutes requires 120 test users. You should take this into account during test planning.

Web load testing (article 1) - how to keep the load steady

I recently came across a series of articles written in 2001 by Alberto Savoia which impressed me very much. If you search for these titles, you can still find them:
“Web load test planning”
“Trade secrets from a web testing expert”
“Web page response time 101”

The second of Savoia's articles covers three main topics:
Misunderstanding concurrent users
Miscalculating user abandonment
Over-averaging page response times

When I read these I was interested (and somewhat relieved!) to find that much of what Savoia recommends aligns with my own approach to web performance testing.

In this article, I’ll outline a practical approach to how I deal with avoiding misunderstandings concerning concurrent user load.

Savoia rightly pointed out that when defining load for a web performance test, the starting point should be “number of user sessions started per hour”. (It matters less how long each of these individually takes from start to end, though as I will point out in a subsequent article, it cannot be completely ignored.)

Most of the well known load testing tools allow for “pacing” of a test user. You can arrange for a test user to repeat the same session with the start times for each session spaced apart at the “pacing” interval.

It is tempting to ignore this and perhaps try to disable pacing so as to start each session immediately the previous one has completed. Believe me, this is almost always not a good thing to do. The reason is essentially quite simple. As there are variations in the time each user session will actually take (particularly under load) the rate at which user sessions start will be uneven, and you will be unable to explain to anyone after the test what load you applied actually.

So keen am I to ensure that the load applied by a test user is even, that I employ a trick taught to me by a seasoned load testing professional (you know who you are Neil!) to measure the pacing achieved in the test. All decent load testing tools allow you to time transactions by inserting a “start” and “end” in a test script. All you do is start a transaction and immediately end it, at the very start of your script processing loop. This creates a transaction whose duration is always zero. At first sight this does not appear very useful. However the time interval between these transactions should align with the pacing interval. After the test, you can extract the transaction data for the test users, pop them in an Excel file and add a column which just uses a simple subtraction to calculate the intervals between the transactions. This should match the defined pacing.

Another trick to watch out for in the same area concerns the ramp-up and ramp-down graphs you will often see generated by load testing tools. I like to make each test user perform an exact number of sessions and using the pacing I can predict the time by which the last one should have ended. If just one of the sessions took longer than expected, you will see some sort of unevenness during the ramp-down. If all the sessions take longer, all test users are affected and you will see the ramp-down delayed. I have seen this in real life and it always points to either unexpected behaviour in the web system being tested or incorrect setup of the pacing.

Let me know what you think of this approach and have any ideas for future articles in the same area.

Zombie Testing

Mindlessly executing a program with the intent of filling spreadsheets with useless data, writing thick boiler plate bug reports, and meeting company metrics. Find bugs is a secondary pursuit. Any requirements verifications should be done to the letter; there’s no reason to check the intent of the requirements.

It’s true that zombies walk among us! Your fellow software testers might be zombie testers! But how do you know? I’m going to lay out some indicators that might out your colleague as a zombie:

They act with mindless disregard toward their job (Passive not Proactive).

Testers that have turned display a mindless disregard for their job. They do not think; their brains are not buzzing with activity! They are not analyzing requirements, looking for ambiguities. They are not listening to conversations about new software implementation. They are not engaged with investigating bugs to find the root cause. They don’t communicate their unsubstantiated concerns and hunches to the developers or the project managers. Zombie testers never pick up on disconnects between technical people and customers, nor would they intervene to increase understanding anyway. They particularly like it when these test cases are written by a QA manager and just executed by them. Less thinking = better!

They protect the status quo, rejecting new methods of testing.

Not only are their minds soft, but they will seek to eat the brains of all who challenge the group think. They stifle creativity. They have a “not invented here” mentality. Zombie testers will continue to look for bugs using the same heuristics, and they will continue to believe that they are being effective. Anyone that questions this status quo will be dealt with severely. Past successes will be used to give credence to their methods; they will point to the relative stability of the group’s code base. They will reject radical new testing paradigms like Exploratory Testing in favor of set, predetermined test cases. Using their super lethargic strength, they will cursh all dissenters until they are assimilated as zombie testers too. Stay clear! Their complacency is overpowering!

They don’t question developers, other testers, managers, authority figures, or process; they just march slowly and aimlessly in stride with everyone else.

A tester that doesn’t challenge anyone or anything displays traits of potential zombification. Zombie testers will never go against software developers; they will assume that since the developer is an eccentric genius they must know what they are doing. After all, complicated and over-engineered code is better; right? When the developer makes statements like “I know what the business wants more than they do”, a zombie tester will agree that this technical person knows the business better than the business people. Zombie testers won’t disagree with other testers that have more experience than them; after all, these senior testers have written more tests than them. These senior testers also protect the status quo, which is great! Zombie testers won’t disagree with their supervisors, even if their supervisor comes in with some crappie half-baked idea that he learned at a conference. They won’t question the guru testers that have written the books. Never mind the fact that some of those grey hair gurus haven’t tested since COBOL was en vogue.

They place too much faith in automated testing.

To a testing zombie, automated tests are the end all be all! Every test requires a hammer, and our automated test platform is a big ass hammer. Mindlessly crank these things out. It’s easy! Just model each test on the countless examples that are already in the massive suites. There’s something so therapeutic about not having to think while writing test scripts. It’s like watching TV. These zombies love to push the button on the suite and watch as all their tests come up green. They are all green so that must mean that the tests are all valid, cover the code correctly, and actually test what we want to test. If the tester starts to question the integrity of the testing harness or decides to manually check something, he or she is probably not a zombie. He or she is probably a healthy tester doing their job.

They are more proud of the pretty documentation than the actual bugs they found.

One time, Bill had an interview with a tester candidate that was truly excited about the organization and prettiness of her bug reports. He told me that in his head, he was thinking, “NO! I don’t care about that. I want you to find bugs, not write reports!” If your testers really start to glow and brag about their documentation, then they might have already turned. If they are overly critical about the way you write up your bugs or insist that you use boiler plate sheets, then be standoffish. They might be feasting on your brains at any moment! Tester zombies are slow, but they can be fast to protect the status quo. If your tester is not excited about they bugs they found or the innovative approaches they took to finding them, be very concerned! Note: zombie testers tend to value metrics, methods, procedures, and systems more than they value making a difference in the quality of the software. If they constantly quote material from the ISTQB certification test, then analyze how pedantic their tone is. The more pedantic their tone, the more they’ve turned.

Main attributes of test automation

Below are some of the attributes of test automation that can be measured:

Maintainability

Definition : The effort needed to update the test automation suites for each new release.
Possible measurements : The possible measurements can be e.g. the average work effort in hours to update a test suite.

Reliability

Definition: The accuracy and repeatability of your test automation.
Possible measurements: Number of times a test failed due to defects in the tests or in the test scripts.

Flexibility

Definition : The ease of working with all the different kinds of automation test ware.
Possible measurements : The time and effort needed to identify, locate, restore, combine and execute the different test automation test ware.

Efficiency

Definition : The total cost related to the effort needed for the automation.
Possible measurements : Monitoring over time the total cost of automated testing, i.e. resources, material, etc.

Portability

Definition : The ability of the automated test to run on different environments.
Possible measurements : The effort and time needed to set-up and run test automation in a new environment.

Robustness

Definition : The effectiveness of automation on an unstable or rapidly changing system.
Possible measurements : Number of tests failed due to unexpected events.

Usability

Definition : The extent to which automation can be used by different types of users (Developers, non-technical people or other users etc.,)
Possible measurements : The time needed to train users to become confident and productive with test automation.

Measurements may be quite different from project to project and one cannot know what is best unless one has clearly understood the objectives of the project.

For example, for software that is regularly changing, with frequent releases on many platforms, the important attributes will be ease of maintaining the tests and - of course - portability.

Difference between Smoke testing and Sanity Testing

Sanity Testing :

1. Sanity Testing is done after Smoke Testing is done, to Check the least functionality of the application meets the Requirements.

2. Testing the build in such a way that the basic GUI functionalities are in sane with the requirement specifications.i.e testing whether the links are working properly ,whether the UI is acceptible or not.

3. Sanity testing is a very basic check to see if all software components compile with each

other without a problem. This is just to make sure that developers have not defined

conflicting or multiple functions or global variable definitions

Smoke testing:

1. It is a random testing on the build for the basic functionality

2. After Receiving the Build from the Development Site, the Basic and Generally Identified Test Cases will be Tested. If these Testcases Passed without any Issues then, the Further down Testing Efforts will be Planned. If it fails then the Build is sent back to the Development Team.

But finally these two are almost same except that Smoke testing will be done by developers and Sanity testing will be done by testers

After doing installation testing, we do Smoke testing to check whether the build is read for testing. Then we do Sanity testing to check whether minimum functionality is working or not. If the minimum functionality it self is failed then at this stage only the build can be rejected to dev back.

What is Six Sigma

Six Sigma refers to a philosophy, goal, or methodology used to reduce waste and improve the quality, cost and time performance of any business. Sigma is a Greek letter used to indicate the amount of variation or defect level in a product.

A typical company today might be performing at the three sigma level, meaning they are experiencing one defect out of 16 opportunities. This would equate to about 67,000 defects per million opportunities. A better company might be at the four sigma level or one defect per 160 opportunities. Not bad, but still over 6,000 errors per million.

A performance level of six sigma is equal to 3.4 defects per million opportunities

The Testing Estimation Process

One of the most difficult and critical activities in IT is the estimation process. I believe that it occurs because when we say that one project will be accomplished in such time by at such cost, it must happen. If it does not happen, several things may follow: from peers’ comments and senior management’s warnings to being fired depending on the reasons and seriousness of the failure.

Before even thinking of moving to Systems test at my organization, I always heard from the development group members that the estimations made by the Systems test group were too long and expensive. Then, when I arrived at my new seat, I tried to understand the testing estimation process.

The testing estimation process in place was quite simple. The inputs for the process, provided by the development team, were: the size of the development team and the number of working days needed for building a solution before starting systems tests.

The testing estimation process said that the number of testing engineers would be half of the number of development engineers and one third of the number of development working days.

A spreadsheet was created in order to find out the estimation and calculate the duration of tests and testing costs. They are based on the following formulas:

Testing working days = (Development working days) / 3.

Testing engineers = (Development engineers) / 2.

Testing costs = Testing working days _* Testing engineers _* person daily costs.

As the process was only playing with numbers, it was not necessary to register anywhere how the estimation was obtained.

To exemplify how the process worked, if one development team said that to deliver a solution for systems testing it would need 4 engineers and 66 working days then, the systems test would need 2 engineers (half) and 21 working days (one third). So, the solution would be ready for delivery to the customer after 87 (66+21) working days.

Just to be clear, in testing time, it was not included the time for developing the testcases and preparing the testing environment. Normally, it would need an extra 10 days for the testing team.

The new testing estimation process

Besides being simple, that process worked fine for different projects and years. But, I was not happy with this approach and my officemates from the development group were not, either. Metrics, project analogies, expertise, requirements, nothing were being used to support the estimation process.

I mentioned my thoughts to the testing group. We could not stand the estimation process for very long. I, myself, was not convinced to support it any more. Then, some rules were implemented in order to establish a new process.

Those rules are being shared below. I know that they are not complete and it was not my intention for estimating but, from now, I have strong arguments to discuss my estimation when someone doubts my numbers.

The Rules

1^st Rule: Estimation shall be always based on the software requirements

All estimation should be based on what would be tested, i.e., the software requirements.

Normally, the software requirements were only established by the development team without any or just a little participation from the testing team. After the specification have been established and the project costs and duration have been estimated, the development team asks how long would take for testing the solution. The answer should be said almost right away.

Then, the software requirements shall be read and understood by the testing team, too. Without the testing participation, no serious estimation can be considered.

2^nd Rule: Estimation shall be based on expert judgment

Before estimating, the testing team classifies the requirements in the following categories:

¨ Critical: The development team has little knowledge in how to implement it;

¨ High: The development team has good knowledge in how to implement it but it is not an easy task;

¨ Normal: The development team has good knowledge in how to implement.

The experts in each requirement should say how long it would take for testing them. The categories would help the experts in estimating the effort for testing the requirements.

3^rd Rule: Estimation shall be based on previous projects

All estimation should be based on previous projects. If a new project has similar requirements from a previous one, the estimation is based on that project.

4^th Rule: Estimation shall be based on metrics

My organization has created an OPD, Organization Process Database, where the project metrics are recorded. We have recorded metrics from three years ago obtained from dozens of projects.

The number of requirements is the basic information for estimating a testing project. From it, my organization has metrics that guide us to estimate a testing project. The table below shows the metrics used to estimate a testing project. The team size is 01 testing engineer.

	Metric	Value
1	Number of testcases created for each requirement	4,53
2	Number of testcases developed by Working day	14,47
3	Number of testcases executed by Working day	10,20
4	Number of ARs for testcase	0,77
5	Number of ARs verified by Working day	24,64

For instance, if we have a project with 70 functional requirements and a testing team size of 2 engineers, we reach the following estimates:

Metric	Value
Number of testcases – based on metric 1	317,10
Preparation phase – based on metric 2	11 working days
Execution phase – based on metric 3	16 working days
Number of ARs – based on metric 4	244 ARs
Regression phase – based on metric 5	6 working days

The testing duration is estimated in 22 (16+6) working days. Plus, 11 working days for preparing it.

5^th Rule: Estimation shall never forget the past

I have not sent away the past. The testing team continues using the old process and the spreadsheet. After the estimation is done following the new rules, the testing team estimates again using the old process in order to compare both results.

Normally, the results from the new estimate process are cheaper and faster than the old one in about 20 to 25%. If the testing team gets a different percentage, the testing team returns to the process in order to understand if something was missed.

6^th Rule: Estimation shall be recorded

All decisions should be recorded. It is very important because if requirements change for any reason, the records would help the testing team to estimate again. The testing team would not need to return for all steps and take the same decisions again. Sometimes, it is an opportunity to adjust the estimation made earlier.

7^th Rule: Estimation shall be supported by tools

A new spreadsheet has been created containing metrics that help to reach the estimation quickly. The spreadsheet calculates automatically the costs and duration for each testing phase.

There is also a letter template that contains some sections such as: cost table, risks, and free notes to be filled out. This letter is sent to the customer. It also shows the different options for testing that can help the customer decides which kind of test he needs.

8^th Rule: Estimation shall always be verified

Finally, All estimation should be verified. I’ve created another spreadsheet for recording the estimations. The estimation is compared to the previous ones recorded in a spreadsheet to see if they have similar trend. If the estimation has any deviation from the recorded ones, then a re-estimation should be made.