Telemetry API: Load & Stress Testing
This page describes the plan for the load and stress tests related to the Telemetry API, its execution, and results
Motivation
The Telemetry API is a component that will be receiving requests from every product that uses the Telemetry Monitor directly or indirectly (through the Controller). This includes (but is not limited to):
RapidScan
VBUC
SnowConvert (all "flavors")
SnowSpark
These products are widely used and, combined, have many daily executions. That is why our team is very concerned about the performance of this API.
Test Plan
Introduction
We gathered information about the execution of these tools. For instance, we have this numbers:
RapidScan has been executed as much as 75 times on a single day
Snow (Teradata) has been executed as much as 200 times on a single day (looking at the data from August 2021)
Snow (Oracle) has been executed as much as 78 times on a single day (looking at the data from August 2021)
Snow (Transact) has been executed as much as 116 times on a single day (looking at the data from August 2021)
Based on this data, we created the following worst case scenario for a day:
RapidScan will be executed 75 times
SnowConvert (Teradata) will be executed 200 times
SnowConvert (Oracle) will be executed 78 times
SnowConvert (Transact) will be executed 116 times
After adding that number, we got 469 executions on the same day. If we duplicate this number (to account for the requests we would have from other products), we get 938 executions a day. If we assume that all these executions take place during an 8 hour span*, we would have 120 (rounding up) executions per hour, which would translate into 30 executions in 15 minutes.
The average elapsed time for SnowConvert executions (taken from the data that has been uploaded to the Assessment DB) is 5 minutes (rounding up).
A request to the Telemetry API is made every 30 seconds. This means the average execution (5 minutes) will result in 10 requests.
An average execution of RapidScan has 9 reported events, and the number of events didn't seem to affect much the time that it took for a request to complete.
*This assumption is based on the fact that many of the requests we receive seem to fit with the schedule of a person that is working 8 hours a day in a timezone similar to UTC-6. This assumption is also a worst case: if the executions were evenly divided between all 24 hours a day, that would mean it is less likely for the API to receive two simultaneous requests.
Summary of important metrics
30 executions in 15 minutes
The average execution time is 5 minutes
The average number of requests for an execution is 10
At most 45 executions in 15 minutes
Using a Poisson distribution we can tell that, if the average is 30 executions in 15 minutes, there is a 99.6% chance that there won't be more than 45 executions in 15 minutes. Read more about this calculation in the Appendix A.
Test Plan
With these numbers, we can extract multiple scenarios for testing. An explanation for each case is included further in this document.
Test ID | Executions | Number of Requests | Time (seconds) between requests** | Number of Events per Request | API Method | Ramp Up |
1 | 45 | 20 | 10 | 25 | POST Events | 2s |
2 | 5 | 5 | 10 | 25 | POST Events | 2s |
3 | 45 | 20 | 10 | 25 | POST Exceptions | 2s |
4 | 20 | 40 | 10 | 100 | POST Events | 2s |
5 | 20 | 40 | 10 | 100 | POST Exceptions | 2s |
6 | 2 | 5 | 30 | 2000 | POST Exceptions | 2s |
**Time between two requests of the same client
Test Case #1 (Stress Testing)
This case was designed thinking in:
The max number of executions: 45
20 requests (10 minutes execution)
10 seconds between requests (30 seconds would be the standard, but the test case would take a lot. Since less time between requests is actually more stress, we took the freedom of changing it to 10 seconds, which is a third of the real time).
We will test the POST Event method of the API. Test Case #3 tests the exceptions with the same parameters.
This is actually more traffic than expected in the worst case, since this test takes less than 5 minutes, but the 45 executions limit was thought for a 20 minutes time span.
Test Case #2
This is a most relaxed version of Test Case #1, included mostly so we can compare a heavy load with a smaller load.
Test Case #3 (Stress Testing)
Identical to Test Case #1, but with the POST Exception method of the API.
Test Case #4 (Load Testing)
This case was designed thinking in:
20 executions, which is a bit more traffic than expected for this test (it takes less than 7 minutes)
40 requests (thinking about long executions that take 20 minutes. In our assessment database, there are only two executions that take more than 20 minutes, out of 9943 at the moment of this writing, and those took less than 24 minutes).
10 seconds between requests (30 seconds would be the standard, but the test case would take a lot. Since less time between requests is actually more stress, we took the freedom of changing it to 10 seconds, which is a third of the real time).
We will test the POST Event method of the API. Test Case #5 tests the exceptions with the same parameters.
Test Case #5 (Load Testing)
Identical to Test Case #4, but with the POST Exception method of the API.
Test Case #6
This case was designed thinking in requests with a big number of events/exceptions. Since exceptions include more information that events (and occupy more bytes), this test was performed only with the POST Exception method of the API. The objective of this test was to show that there is not a big difference between processing a large number of events (2000) vs processing a small number of events (25), because of the implementation of the Telemetry API.
Execution
We executed these tests with Apache JMeter. The file for the test plan is here (you can open it with Apache JMeter. Please do not run it without authorization):
You can enable/disable the different nodes in the left panel to choose which test will you run, and in which environment will you run it. You can also tweak the Thread Group configuration to choose the number of clients (Number of Threads) and number of requests (Loop count). The Sleep Action must be modified to change the time between two requests of the same client.
The results of an execution are shown when clicking on the three bottom nodes. Graph Results show a a graph indicating throughput, average response time, standard deviation for response time, and more data (over time). View Results Tree shows every request that was performed during the test. Summary Report shows summary data for the executions (average response time for all requests, max response time for all requests, etc...).
Results
Average, Min, Max and Std Dev for the response time of all requests in each test case (in milliseconds)
Test ID | Average | Min | Max | Std Dev |
1 | 15637 | 276 | 50692 | 11112.2 |
2 | 743 | 228 | 1470 | 305.3 |
3 | 7557 | 197 | 42441 | 9899.46 |
4 | 11957 | 276 | 89196 | 12190.2 |
5 | 6448 | 325 | 37868 | 6861.92 |
6 | 2030 | 2030 | 2030 | 0 |
Overall performance
Even in the worst scenarios, the average of time for a request to be completed was 15 seconds
In a more relaxed scenario such as Test Case #2, the average was less than 1 second to complete, and at most 1 and a half second.
The worst request took almost 90 seconds to complete
There were no errors, and all the data was uploaded correctly, and is consistent in the database and the application insights resource.
Behavior of the API during the worst scenarios
In these graphs
The average response time is represented by the blue line
The median response time is represented by the purple line
These graphs show that the first requests took longer to complete. This is probably because the ramp up time was 2 seconds, which means that every 2 seconds a new "user" appeared. The API must have taken its time before correctly balancing the load. The response time stabilizes after the first "batch" of requests, in both cases.
Conclusions
The average and max response times are good enough for their respective scenarios, considering that:
Once the request is sent, the Telemetry Monitor does not have to wait for the completion of the request. Currently, it does, but this is done on another thread, so the request is non-blocking for the execution of the tool.
All of the requests were processed correctly and the integrity of the data was preserved without any errors.
The performance of the API can be improved by adding a queue to process the different jobs (such as the one that is used in the Assessment API)
This would mean we can wait (in the Telemetry Monitor) for the response of the server, and the server will only tell us that the data is being processed. This would radically improve the response time of the API.
The database can handle at most 60 concurrent connections. This will be increased for the production.
Appendix A: Calculating the max number of executions for a 15 minutes time span
Calculations
The Cumulative Distribution Function for the Poisson Distribution takes two parameters k and lambda
And tells us the probability of the following event: "An event will happen at most k times in an interval of time", given that the same event happens lambda times (in average) in the same interval of time.
Using lambda = 30 and k = 45, we got the value 0.996. This indicate that, there is a 99.6% chance that there won't be more than 45 new executions in a 15 minute timespan, given that the average ammount of new executions in a 15 minutes time span is 30 (lambda).
Reasoning
The Poisson distribution is usually used to determine the probability of an event happening k times during a time interval. It works best when the events are independent. We also took some considerations (like thinking that maybe there was an 8 hour time span in which the number of requests was much more than during the rest of the day). We considered that the event we wanted to analyze was the first request performed by a session.
Apache JMeter has a number of Timers we can use to simulate the apparition of new events:
However, we did not want to commit to an specific tool, since we are considering automating this tests using Azure DevOps in the future.
Last updated