Telemetry API: Load & Stress Testing (Updated 16/11/2022)
This page describes the plan for the load and stress tests related to the Telemetry API, its execution, and results
Motivation
The Telemetry API is a component that will be receiving requests from every product that uses the Telemetry Monitor directly or indirectly (through the Controller). This includes (but is not limited to):
RapidScan
VBUC
SnowConvert (all "flavors")
SnowSpark
BlackDiamond Studio
These products are widely used and, combined, have many daily executions. That is why our team is very concerned about the performance of this API.
Test Plan
Introduction
We gathered information about the execution of these tools. For instance, we have this numbers:
RapidScan has been executed as much as 15 times on a single day
Snow (Teradata) has been executed as much as 446 times on a single day (looking at the data from November 2022)
Snow (Oracle) has been executed as much as 339 times on a single day (looking at the data from November 2022)
Snow (Transact) has been executed as much as 373 times on a single day (looking at the data from November 2022)
Spark Snow Convert has been executed as much as 76 times on a single day (looking at the data from November 2022)
BDS Portal has been executed as much as 460 times on a single day (looking at the data from November 2022)
BDS IDE has been executed as much as 183 times on a single day (looking at the data from November 2022)
Based on this data, we created the following worst case scenario for a day:
RapidScan will be executed 15 times
SnowConvert (Teradata) will be executed 446 times
SnowConvert (Oracle) will be executed 339 times
SnowConvert (Transact) will be executed 373 times
SparkSnow Convert will be executed 76 times
BDS Portal will be executed 460 times
BDS IDE will be executed 183 times
After adding that numbers, we got 1892 executions on the same day. If we assume that all these executions take place during an 8 hour span, we would have 237 (rounding up) executions per hour, which would translate into 60 (rounding up) executions in 15 minutes.
The average elapsed time for executions (taken from the data that has been uploaded to the Assessment DB) is 28 minutes (rounding up).
*This assumption is based on the fact that many of the requests we receive seem to fit with the schedule of a person that is working 8 hours a day in a timezone similar to UTC-6. This assumption is also a worst case: if the executions were evenly divided between all 24 hours a day, that would mean it is less likely for the API to receive two simultaneous requests.
Summary of important metrics
60 executions in 15 minutes
The average execution time is 28 minutes
The average number of requests for an execution is 3, obtained from the calculation on the excel sheet
At most 80 executions in 15 minutes
Using a Poisson distribution we can tell that, if the average is 60 executions in 15 minutes, there is a 99.4% chance that there won't be more than 80 executions in 15 minutes. Read more about this calculation in the Appendix A.
Test Plan
With these numbers, we can extract multiple scenarios for testing. An explanation for each case is included further in this document.
Test ID | Executions | Number of Requests | Number of Events per Request | API Method | Total Time |
---|---|---|---|---|---|
1 | 239 | 3 | 5 | POST Events | 3600s |
3 | 318 | 3 | 5 | POST Events | 3600s |
4 | 2390 | 3 | 5 | POST Events | 3600s |
Total time it's the expected time to execute all executions and requests, but you can use the trick from fourth section of the "Telemetry API: Executing the Load & Stress Test" page, to run an equivalent test in just 15 min.
The following table contains the 15 minute tests, equivalent to the tests in the previous table
Test ID | Threads (users) | Ramp Up | Loop Count |
---|---|---|---|
1 | 180 | 900s | 1 |
2 | 239 | 900s | 1 |
3 | 1793 | 900s | 1 |
Test Case #1 (Stress Testing)
This case was designed thinking in test the max number of users we have seen until 16/11/2022:
The max number of executions in 15 min: 60
The average of request: 3
We will test the POST Event method of the API.
Test Case #2 (Max Stress test)
This case was designed thinking in what happen if we suddenly have a increment of 33.33% of users. So this test is the same of test 1 but with a 33.33% of more executions.
This test will tell us how the telemetry will behave with 318 executions per hour and we can assure with a confidence of 99.2% that we will not have more than 318 request per hour based on the data collected until today (16/11/2022) if we keep with an average of 60 executions every 15 min.
Test Case #3
Similarly to the test 2, this test is a modification of the first test but with the difference that we want to test the performance at maximum level by increasing the executions per hour by a factor of x10. Regardless that we saw before, is 0.78% improbable that we suddenly have an increment of 33% of executions but we want to make sure that the performance keeps fine in all possible scenarios.
Execution
We executed these tests with Apache JMeter. The file for the test plan is here (you can open it with Apache JMeter. Please do not run it without authorization):
You can enable/disable the different nodes in the left panel to choose which test will you run, and in which environment will you run it. You can also tweak the Thread Group configuration to choose the number of clients (Number of Threads) and number of requests (Loop count). The Sleep Action must be modified to change the time between two requests of the same client.
The results of an execution are shown when clicking on the three bottom nodes. Graph Results show a a graph indicating throughput, average response time, standard deviation for response time, and more data (over time). View Results Tree shows every request that was performed during the test. Summary Report shows summary data for the executions (average response time for all requests, max response time for all requests, etc...).
Results
Average, Min, Max and Std Dev for the response time of all requests in each test case (in milliseconds)
Test ID | Average | Min | Max | Std Dev | Error % |
1 | 599 | 374 | 5313 | 389.31 | 0 |
2 | 491 | 363 | 1403 | 141.91 | 0 |
3 | 548 | 334 | 6055 | 395.90 | 0 |
Overall performance
Even in the worst scenarios, the average of time for a request to be completed was less than a second
The worst request took 6 seconds to complete.
With less than 318 executions per hour (which represents an increment of 33.33% of our max historical executions), there were no errors. And as we said early the probability to the number of executions per hour keeps under 318 is 99.2%.
Even with 2390 executions per hour (which represents 10 times our max historical executions) there were no errors.
Behavior of the API during the worst scenarios
In these graphs
The average response time is represented by the blue line
The median response time is represented by the purple line
These graphs show that the first requests took longer to complete. The API must have taken its time before correctly balancing the load. The response time stabilizes after the first "batch" of requests. We can see in the execution of test #3 that the request were more stabilized.
Conclusions
The average and max response times for each test are good enough for their respective scenarios, considering that:
Once the request is sent, the Telemetry Monitor does not have to wait for the completion of the request. Currently, it does, but this is done on another thread, so the request is non-blocking for the execution of the tool.
All of the requests were processed correctly and the integrity of the data was preserved without any errors.
The performance of the API can be improved by adding a queue to process the different jobs (such as the one that is used in the Assessment API)
This would mean we can wait (in the Telemetry Monitor) for the response of the server, and the server will only tell us that the data is being processed. This would radically improve the response time of the API.
Appendix A: Calculating the max number of executions for a 15 minutes time span
Calculations
The Cumulative Distribution Function for the Poisson Distribution takes two parameters k and lambda
And tells us the probability of the following event: "An event will happen at most k times in an interval of time", given that the same event happens lambda times (in average) in the same interval of time.
Using lambda = 60 and k = 79, we got the value 0.992. This indicate that, there is a 99.2% chance that there won't be more than 79 new executions in a 15 minute timespan, given that the average amount of new executions in a 15 minutes time span is 60 (lambda).
Reasoning
The Poisson distribution is usually used to determine the probability of an event happening k times during a time interval. It works best when the events are independent. We also took some considerations (like thinking that maybe there was an 8 hour time span in which the number of requests was much more than during the rest of the day). We considered that the event we wanted to analyze was the first request performed by a session.
Apache JMeter has a number of Timers we can use to simulate the apparition of new events:
However, we did not want to commit to an specific tool, since we are considering automating this tests using Azure DevOps in the future.
Last updated