[dss_bench] Tool to generate automatic graphs for q/s based on various parameters#1519
[dss_bench] Tool to generate automatic graphs for q/s based on various parameters#1519the-glu wants to merge 1 commit into
Conversation
| scopes: list[str] = [] | ||
| default: bool = True |
There was a problem hiding this comment.
name is probably sufficiently self-documenting, but I'm not sure what these are from inspection and this is a base class that will be used in (presumably) a number of places -- let's document what these are.
|
|
||
| try: | ||
| test.setup(session, base_url) | ||
| except Exception: |
There was a problem hiding this comment.
This will prevent even the user from cancelling execution with KeyboardInterrupt; it seems like we should be much narrower in the exceptions we catch. What exceptions would we want to accept and continue for here? Wouldn't we expect the setup to work, and want to stop a test as probably invalid if the setup wasn't successful?
There was a problem hiding this comment.
It doesn't prevent the user to cancel execution, Exception is not a subclass of KeyboardException.
>>> import time
>>> try:
... time.sleep(200)
... except Exception as e:
... print("Catched")
...
^CTraceback (most recent call last):
File "<python-input-2>", line 2, in <module>
time.sleep(200)
~~~~~~~~~~^^^^^
KeyboardInterrupt
However yes, letting the test run when setup fail is probably wrong, I switched to an early return.
I let the teardown catched however: failing is probably less an issue, especially since datastore are reset everytime. It that ok?
| test.action(session, base_url) | ||
| latencies_ms.append((time.monotonic() - t0) * 1000.0) | ||
| done += 1 | ||
| except Exception: |
There was a problem hiding this comment.
This seems like an overbroad catch; could we just use query_and_describe to catch the right exceptions in the right circumstances and then check whether the query succeeded?
There was a problem hiding this comment.
We could restrict the catch, but the idea is to be large to catch others potential errors (like wrong data returned, etc.).
query_and_describe also do much more that simple queries (including potential retries), and in the testing case I don't think we want to do that? Idea is to do simple queries (like others loadtest), not to have the "full" query framework.
|
|
||
|
|
||
| def run_test( | ||
| test: BenchTest, targets: list[tuple[str, str]], cfg: GlobalConfig |
There was a problem hiding this comment.
It's hard to figure out what "targets" is, requiring tracing though the code; let's just make a simple data structure so it's super clear:
@dataclass
class Target:
base_url: str
audience: str| test: BenchTest, targets: list[tuple[str, str]], cfg: GlobalConfig | |
| test: BenchTest, targets: list[Target], cfg: GlobalConfig |
...but, it doesn't seem like carrying audience is even necessary since it's a function of the base URL (using an AuthAdapter/UTMClientSession will take care of this automatically).
| # survivorship bias of percentiles computed over successes only. | ||
| with_errors = merged + merged_errors | ||
|
|
||
| return { |
Follow #1518
This PR adds a new tool to generate meaningful graphs to compare the performance of various scenarios.
As of now, we do have Locust tests. They serve some purposes (mainly variations over time), but using them to validate performance can be time-consuming and prone to error. We also have a tendency to use various, incompatible parameters between tests.
An extra consideration is the fact that CockroachDB data is distributed differently between every run, meaning that tests with NUM_USS and NUM_NODE greater than one must average performance across every DSS, not just the first one.
The framework proposed here aims to measure performance as a single point: no change over time, and in theory, each test cleans up after itself. Example: a test that creates and deletes a single operational intent (included here as an example).
Then, we add a variant, which represents the X-axis of our graphs. These could be multiple; for example: the number of existing subscriptions, or the number of workers. This PR includes an inter-USS latency context as an example.
Finally, an option is available to compare different images or different datastores, with the idea of doing comparisons (for example, in a PR against master, or to compare performance between datastores, which will be needed for Raft).
The framework automatically cleans up and runs 'start-locally' for every data point, then produces a graph. A JSON file is also stored for future use.
The test is executed against all DSS at the same time and averaged.
Example graph with latest version:
This allows us to generate useful graphs, like this one showing how latency heavily impacts queries as simple as RID operational intents:
(⚠️ This graph has been generated before displaying errors)
Another example comparing the current master and the latest release on RID:
(⚠️ This graph has been generated before displaying errors)
This shows small variations (at least in terms of QPS), probably explained by the fact that I ran it on my machine while other processes were running. Note that tests should probably be run on a dedicated machine, free from external influences as much as possible. The graph shown there are only for demonstration.
Notice that a run can take a significant amount of time, especially with database initialization on high latencies.
This PR is a first test, goal is to add more tests or variant in future PR, especially a RID ISA with one subscription, and one SCD test (based on flightinsubs).