A Case Study on Performance Testing of a distributed Fault Management System
Introduction
This article provides a brief case study on the innovative approaches taken in design and execution of Performance testing for a distributed fault management system application in Unix/Java Environment.
This article proposes a Java instrumentation model using JetM libraries, followed by an approach of handling the large chunks of time series data generated by JetM instrumentation. These two approaches will provide an effective approach for pinpointing code-level performance bottlenecks and also provide a good decision support system for any performance study in Java environment.
About Performance Testing
Performance Testing can be defined as a study to measure what parts of a device or software contribute most to the poor performance or to establish throughput levels (and thresholds) for maintained acceptable response time.
To explain performance testing of a distributed system, let us use the analogy of the Cardiac Stress Test. In a typical stress test, first you will have wires to an ECG machine attached to your chest, and a blood pressure cuff is placed on your arm. Sensors may be placed at different parts of your body to measure the amount of oxygen in your blood. After a baseline ECG is obtained, you will be asked to begin performing a low level of exercise, either by walking on a treadmill or pedaling a stationary bicycle. The exercise is "graded" - that is, every three minutes, the level of exercise is increased. At each "stage" of exercise, your pulse, blood pressure and ECG are recorded, along with any symptoms you may be experiencing. The testing will have different approaches like testing till the level you exhaust, or stop at a predetermined time/stress level. This stress test will provide vital data about your heart and circulatory system to determine to find out any cardiac related problems, which is easily missed when a person is at rest.
Similarly in a distributed application, the overall performance of the system is influenced by different bottlenecks in any one or multiple components of the system and our job is to find out such bottlenecks. The bottlenecks could be of coding related, hardware or network related issues. It is critical for performance testing to understand the system, its performance level and making appropriate recommendations/decisions to engineering teams to decide on the deployment of the application in production.
So, in order to determine the performance level of the end to end distributed system, it is important to design a performance testing approach similar to the above analogy.
In a distributed system, to understand the performance of the end to end process, it is important to determine the critical performance measurements of individual components and the overall system such as :
1. Response Time
2. Processing Time
3. Throughput
4. Round Trip Time
All these measurements are collected at different work loads of the system like minimum load, optimum load and the peak load. Typical usage patterns to be identified and the mix of activities to be simulated to mimic the real-time usage loads factored with expected increases of load in the immediate future.
Again, these measurements are influenced by many system level and application level performance thresholds like CPU usage %, Heap Memory size, Query processing time of a database, etc.
Performance Testing is the process of identifying the application elements, apply load and monitor & measure the performance behavior of the given system. The following diagram will provide a framework approach for handling a performance testing project.
Our Environment
In this distributed Fault Management System, end user systems are monitored by fault detectors. On a fault event, the event details are transmitted to the Service Provider network where the event is processed to check validity and automatically forwarded to a service team responsible to work on and close the issue. The following is the block diagram of the end-to-end process of the Fault Management System in discussion.
Challenges Faced in PT Design
This environment is completely on Unix platform. Java and J2EE components are extensively used to design the end to end components of the system.
Conventional tools like Load runner or Jmeter are able to provide time measurements only at the component level and will not be able to provide any process level or function level time measurements. But, in a fault management system where the load is very intensive and transmission rate is very fast (in milliseconds), it is difficult to make any decisions just by having only the component level performance data.
Managing time-series data is another challenge when code level instrumentation is implemented. This will result in large chunks of time measurements produced by the instrumented code. The data frequency is in seconds and the data accumulation per hour will be several thousands. Managing and inferring this size of data is again a challenge without proper aggregation mechanism in place.
Solutions
Custom load simulator was used in the environment for generation of required amount of fault event load. This will take care of simulating load across all the components and provide component level time measurements.
Solution for Function Level data collection :
This is the challenge of having a low level profiling of the system, that is getting access to function/module level performance. This challenge can be addressed by using the component based instrumentation approach. Java JetM libraries are used for this.
Solution for Large Time Series data aggregation:
Time series data generated from JetM calls are very large in quantity due to high intensity and frequency. Managing this data to infer any decision is also a big challenge. Normal database or flat file approaches will not help to address this challenge. So rrdtool has been chosen, which is a round robin database to handle this data processing.
Following sections will briefly discuss about these solutions.
JetM Instrumentation
The Java Execution Time Measurement Library (JETM) is a free tool to monitor execution times of Java code fragments.
JetM provides low level profiling so that any portion of code block is measured for its processing time. This provides a highly flexible white box approach for performance measurement.
A performance tester need not be a java expert to understand the functionality of the application implementation. Instead, if the person does sufficient requirement gathering from engineering team to understand implication of different code blocks, then he can embed only the instrumentation in those areas and start measuring the system performance.
Therefore, Implementing JetM instrumentation needs more architectural and performance testing conceptual understanding. Besides the person should be conversant with basic java skills like reading and modifying pieces of code, compiling and creating new application bundle.
| JetmStart:init | 1 | 0.000 | 0.000 | 0.000 | 0.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
| Receptor:receivedTrap | 94378 | 1.558 | 1.000 | 539.000 | 147,073.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
| SNMP_VERSION_1:run: | 46728 | 16.218 | 1.000 | 3,573.000 | 757,853.000 |
| sendMessage: | 9193 | 37.899 | 22.000 | 3,181.000 | 348,410.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
| SNMP_VERSION_2C:run: | 28826 | 2.186 | 1.000 | 38.000 | 63,020.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
| SNMP_VERSION_2C:run:FMA: | 560 | 1.804 | 1.000 | 11.000 | 1,010.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
| SNMP_VERSION_2C:run:SUNMC,ILOM: | 18105 | 74.636 | 54.000 | 4,227.000 | 1,351,284.000 |
| sendMessage: | 17620 | 39.607 | 22.000 | 4,194.000 | 697,879.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
| sendMessage: | 783 | 54.360 | 23.000 | 4,581.000 | 42,564.000 |
|---------------------------------|-------|---------|---------|-----------|---------------|
|-----------------------|--------|-----------|---------|------------|-----------------|
| Measurement Point | # | Average | Min | Max | Total |
|-----------------------|--------|-----------|---------|------------|-----------------|
| Dispatcher:getMessage | 161243 | 2,949.753 | 185.000 | 71,042.000 | 475,627,064.000 |
| getMassage:postUrl | 161243 | 2,607.835 | 1.000 | 60,053.000 | 420,495,175.000 |
|-----------------------|--------|-----------|---------|------------|-----------------|
| JetmStart:init | 2 | 0.000 | 0.000 | 0.000 | 0.000 |
|-----------------------|--------|-----------|---------|------------|-----------------|
| emptyQ | 23689 | 118.568 | 45.000 | 5,619.000 | 2,808,755.000 |
|-----------------------|--------|-----------|---------|------------|-----------------|
| run:deletFromQ | 147053 | 91.482 | 42.000 | 8,875.000 | 13,452,648.000 |
|-----------------------|--------|-----------|---------|------------|-----------------|
| run:getFromQ | 161259 | 255.240 | 41.000 | 11,036.000 | 41,159,787.000 |
|-----------------------|--------|-----------|---------|------------|-----------------|
~
rrdTool implementation
RRDtool is a high performance data logging and graphing system for time series data. This is an open source tool. This has multiple usage for both real time and non real time systems for application like network monitoring, web log analysis, etc.
The advantage is database size will not increase for whatever data is inserted. Instead the data is aggregated based on time rules given at the time of creating the database. This way this will help to retain the overall data trend at different time intervals. With this aggregated data graphs can be generated. One of the sample graph provided below from the original rrdtool web page.
In an environment which has strong presence of Java, rrd4j can be used instead of rrdtool. . This rrd4j provides the flexibility of coding portability in Java language with all features of rrdtool. Customization of this rrd4j can be done based on the environment needs.
JetM provided time series data can be passed through the rrdtool which in turn processes the data and produce the graphical image.
No.of Requests Graphs
Response Time Graph:
Conclusion
In performance testing, measurement is a critical success factor. To measure applications both at high level and at component/function level appropriate tools and techniques should be selected. The approaches provided here for JetM Library usage for instrumentation is one such technique which can be effectively used in Java based application environment.
The use of rrdtool for managing time series data and large amount of records will always provide good leverage of statistical aggregation of the data without losing accuracy. The effective use of this tool with its graph generation utility will provide a decent report and enhance decision making both at engineering and management levels.
0 comments:
Post a Comment