Telemetry API: Load & Stress Testing

This page describes the plan for the load and stress tests related to the Telemetry API, its execution, and results

Motivation

The Telemetry API is a component that will be receiving requests from every product that uses the Telemetry Monitor directly or indirectly (through the Controller). This includes (but is not limited to):

RapidScan
VBUC
SnowConvert (all "flavors")
SnowSpark

These products are widely used and, combined, have many daily executions. That is why our team is very concerned about the performance of this API.

Test Plan

Introduction

We gathered information about the execution of these tools. For instance, we have this numbers:

RapidScan has been executed as much as 75 times on a single day
Snow (Teradata) has been executed as much as 200 times on a single day (looking at the data from August 2021)
Snow (Oracle) has been executed as much as 78 times on a single day (looking at the data from August 2021)
Snow (Transact) has been executed as much as 116 times on a single day (looking at the data from August 2021)

Based on this data, we created the following worst case scenario for a day:

RapidScan will be executed 75 times
SnowConvert (Teradata) will be executed 200 times
SnowConvert (Oracle) will be executed 78 times
SnowConvert (Transact) will be executed 116 times
After adding that number, we got 469 executions on the same day. If we duplicate this number (to account for the requests we would have from other products), we get 938 executions a day. If we assume that all these executions take place during an 8 hour span*, we would have 120 (rounding up) executions per hour, which would translate into 30 executions in 15 minutes.
The average elapsed time for SnowConvert executions (taken from the data that has been uploaded to the Assessment DB) is 5 minutes (rounding up).
A request to the Telemetry API is made every 30 seconds. This means the average execution (5 minutes) will result in 10 requests.
An average execution of RapidScan has 9 reported events, and the number of events didn't seem to affect much the time that it took for a request to complete.

*This assumption is based on the fact that many of the requests we receive seem to fit with the schedule of a person that is working 8 hours a day in a timezone similar to UTC-6. This assumption is also a worst case: if the executions were evenly divided between all 24 hours a day, that would mean it is less likely for the API to receive two simultaneous requests.

Summary of important metrics

30 executions in 15 minutes
The average execution time is 5 minutes
The average number of requests for an execution is 10
At most 45 executions in 15 minutes
- Using a Poisson distribution we can tell that, if the average is 30 executions in 15 minutes, there is a 99.6% chance that there won't be more than 45 executions in 15 minutes. Read more about this calculation in the Appendix A.

Test Plan

With these numbers, we can extract multiple scenarios for testing. An explanation for each case is included further in this document.

Test ID

Executions

Number of Requests

Time (seconds) between requests**

Number of Events per Request

API Method

Ramp Up

POST Events

POST Exceptions

100

POST Events

100

POST Exceptions

2000

POST Exceptions

**Time between two requests of the same client

Test Case #1 (Stress Testing)

This case was designed thinking in:

The max number of executions: 45
20 requests (10 minutes execution)
10 seconds between requests (30 seconds would be the standard, but the test case would take a lot. Since less time between requests is actually more stress, we took the freedom of changing it to 10 seconds, which is a third of the real time).
We will test the POST Event method of the API. Test Case #3 tests the exceptions with the same parameters.

This is actually more traffic than expected in the worst case, since this test takes less than 5 minutes, but the 45 executions limit was thought for a 20 minutes time span.

Test Case #2

This is a most relaxed version of Test Case #1, included mostly so we can compare a heavy load with a smaller load.

Test Case #3 (Stress Testing)

Identical to Test Case #1, but with the POST Exception method of the API.

Test Case #4 (Load Testing)

This case was designed thinking in:

20 executions, which is a bit more traffic than expected for this test (it takes less than 7 minutes)
40 requests (thinking about long executions that take 20 minutes. In our assessment database, there are only two executions that take more than 20 minutes, out of 9943 at the moment of this writing, and those took less than 24 minutes).
10 seconds between requests (30 seconds would be the standard, but the test case would take a lot. Since less time between requests is actually more stress, we took the freedom of changing it to 10 seconds, which is a third of the real time).
We will test the POST Event method of the API. Test Case #5 tests the exceptions with the same parameters.

Test Case #5 (Load Testing)

Identical to Test Case #4, but with the POST Exception method of the API.

Test Case #6

This case was designed thinking in requests with a big number of events/exceptions. Since exceptions include more information that events (and occupy more bytes), this test was performed only with the POST Exception method of the API. The objective of this test was to show that there is not a big difference between processing a large number of events (2000) vs processing a small number of events (25), because of the implementation of the Telemetry API.

Execution

We executed these tests with Apache JMeter. The file for the test plan is here (you can open it with Apache JMeter. Please do not run it without authorization):

You can enable/disable the different nodes in the left panel to choose which test will you run, and in which environment will you run it. You can also tweak the Thread Group configuration to choose the number of clients (Number of Threads) and number of requests (Loop count). The Sleep Action must be modified to change the time between two requests of the same client.

The results of an execution are shown when clicking on the three bottom nodes. Graph Results show a a graph indicating throughput, average response time, standard deviation for response time, and more data (over time). View Results Tree shows every request that was performed during the test. Summary Report shows summary data for the executions (average response time for all requests, max response time for all requests, etc...).

Results

Average, Min, Max and Std Dev for the response time of all requests in each test case (in milliseconds)

Test ID

Average

Min

Max

Std Dev

15637

276

50692

11112.2

743

228

1470

305.3

7557

197

42441

9899.46

11957

276

89196

12190.2

6448

325

37868

6861.92

2030

Overall performance

Even in the worst scenarios, the average of time for a request to be completed was 15 seconds
In a more relaxed scenario such as Test Case #2, the average was less than 1 second to complete, and at most 1 and a half second.
The worst request took almost 90 seconds to complete
There were no errors, and all the data was uploaded correctly, and is consistent in the database and the application insights resource.

Behavior of the API during the worst scenarios

In these graphs

The average response time is represented by the blue line
The median response time is represented by the purple line

These graphs show that the first requests took longer to complete. This is probably because the ramp up time was 2 seconds, which means that every 2 seconds a new "user" appeared. The API must have taken its time before correctly balancing the load. The response time stabilizes after the first "batch" of requests, in both cases.

Conclusions

The average and max response times are good enough for their respective scenarios, considering that:
- Once the request is sent, the Telemetry Monitor does not have to wait for the completion of the request. Currently, it does, but this is done on another thread, so the request is non-blocking for the execution of the tool.
- All of the requests were processed correctly and the integrity of the data was preserved without any errors.
The performance of the API can be improved by adding a queue to process the different jobs (such as the one that is used in the Assessment API)
- This would mean we can wait (in the Telemetry Monitor) for the response of the server, and the server will only tell us that the data is being processed. This would radically improve the response time of the API.
The database can handle at most 60 concurrent connections. This will be increased for the production.

Appendix A: Calculating the max number of executions for a 15 minutes time span

Calculations

The Cumulative Distribution Function for the Poisson Distribution takes two parameters k and lambda

And tells us the probability of the following event: "An event will happen at most k times in an interval of time", given that the same event happens lambda times (in average) in the same interval of time.

Using lambda = 30 and k = 45, we got the value 0.996. This indicate that, there is a 99.6% chance that there won't be more than 45 new executions in a 15 minute timespan, given that the average ammount of new executions in a 15 minutes time span is 30 (lambda).

Reasoning

The Poisson distribution is usually used to determine the probability of an event happening k times during a time interval. It works best when the events are independent. We also took some considerations (like thinking that maybe there was an 8 hour time span in which the number of requests was much more than during the rest of the day). We considered that the event we wanted to analyze was the first request performed by a session.

Apache JMeter has a number of Timers we can use to simulate the apparition of new events:

However, we did not want to commit to an specific tool, since we are considering automating this tests using Azure DevOps in the future.

PreviousTelemetry API: Executing the Load & Stress Test NextTelemetry API: Load & Stress Testing (Updated 16/11/2022)

Last updated 2 years ago