docs: add test plan for ranged loop (#5629)

Co-authored-by: littleskunk <jens.heimbuerge@googlemail.com>
This commit is contained in:
Igor 2023-04-25 15:10:13 +03:00 committed by GitHub
parent 2b6b1f7e08
commit b8983640d6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -0,0 +1,72 @@
# Ranged Loop Testplan
&nbsp;
## Background
This testplan is going to cover the Ranged Loop. It will go over the
design doc seen
here [Confluence](https://storjlabs.atlassian.net/wiki/spaces/ENG/pages/2434531336/Metainfo+Loop+Sharding)
&nbsp;
&nbsp;
| Test Scenario | Test Case | Description | Comments |
|-----------------------------------------------|------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| General | Ranged Loop | It should be a separate satellite command, deployed on a separate machine which will process the segments table in chunks, with each chunk being processed in a separate thread/goroutine--> for a horizontally scaled metainfoloop essentially running in parallel |
| | Observer | Observers should - join the range loop in a different way compared to segments loop, oriented more for concurrent work -be registered in the loop at the start of the process -each have a flag that will determine if it should be enabled for ranged loop or not |
| | Observer Migration | All observers migrated to the new ranged loop should be allowed to be enabled with segment loop or ranged loop explicitly and should be allowed to be enabled one by one in production |
| | Garbage Collection w/Ranged Loop | repair workers are allowed to start early, so if we have 2 or more workers it should be possible to run garbage collection with range loop enabled since the current process of generating bloom filters for garbage collection will stay as a separate peer on a separate machine |
| | Testing Ranged Loop | Ranged loop should be integrated with testplanet and storj-up |
| | Workers Finishing at Different Times | Have two or more workers finish at different times- what happens to the workflow? ex. repair workers are allowed to start early, so if we have 2 or more workers finish different times, then the results would have to be combined. Worst case, one of the workers runs into an error what happens then, does it wait again for a complete run(f.e bloom filter- requires complete run of all workers) or does the process just start regardless |
| | Overlap of Ranges | Split a range for 2 workers. Put at least 3 segments into the total range exactly on the start and end values. Overlap of ranges? Is the range including or excluding the start and end value? |
| Garbage Collection Verify w/Ranged Loop | Creation of Bloom filter | Create a test dataset of pieces of data and corresponding storage node IDs using pointerdb and then use the test dataset to create a bloom filter using the satellite's algorithm to verify that the integration of the bloom filter creation process with the data repair checker loop does not cause any additional overhead to the pointerdb | The test makes sure the Bloom filter is correctly created and the integration with the pointerdb does not cause any additional overhead . Also is the bloom filter creation process still integrated with the data repair checker loop? reword tc if thats the case |
| | Push of Bloom filter to storage nodes | Create a test network of storage nodes and then use the satellite to push a new Bloom filter for each storage node in the network to verify that each storage node correctly receives the filter | are there cases where there can be piece failures and data considered garbage can't be deleted? |
| | Processing of Bloom filter by storage nodes | First, add some test data to each storage node that should be considered as garbage for this test, then have each storage node process the received Bloom filter by checking for pieces of data not included in the filter and older than the filter's creation date to verify that the storage nodes correctly identify and delete the test pieces of data considered garbage | offline duration? nodes that are offline extended periods? separate tests for that? we probably have (have not checked yet) |
| | Handling of missed pushes (Storage Node Offline) | Create a test network of storage nodes and have some nodes simulate being offline during a push of the Bloom filter and lastly verify that the offline nodes correctly process the next push they receive and delete the correct pieces of data | just an e2e of the first three tests to make sure overall process works |
| | Overall garbage collection process E2E | Add test data to the storage nodes and simulate different scenarios such as data being replaced, deleted, or clients stopping payment and then use the satellite to notify the storage nodes of the garbage pieces and verify that the storage nodes correctly delete the data |
| | Clock accuracy | Create a test network of storage nodes and have some nodes simulate inaccurate clock and then verify that the storage nodes with inaccurate clock will suffer reputation failure |
| | Early Continuation | GC should not be allowed to continue early, since then the bloom filter would be incomplete and delete segments that are still needed |
| | Worker Errors Out | GC should not be allowed to continue if a worker errors out, since then the bloom filter would be incomplete and delete segments that are still needed |
| | Concurrency of Ranged Loop | Has to combine work results since its a single bloomfilter for a single storage node | read somewhere that bloom filters can be combined if split into chunks |
| | Test on Saltlake | Test Garbage Collection on saltlake after change to verify that it can handle a large number of pieces and storage nodes | make sure performance increases with ranged loop |
| | Performance | Measure the time it takes for the satellite to create the bloom filter and push it to the storage nodes, the time it takes for a storage node to process the bloom filter, and the memory usage on the satellite and storage nodes |
| Audit Reservoir Sampling Verify w/Ranged Loop | Reservoir Sampling Algorithm | Create a test dataset of segments and corresponding nodes. Use the reservoir sampling algorithm to select segments for each node. Verify that the segments selected for each node are chosen with uniform probability |
| | Creation of The Audit Queue | Use a test network of nodes to simulate different scenarios such as new data being added or nodes being vetted. Use the audit observer to create the audit queue and verify that the segments selected for each node are added to the queue in a random order. |
| | Selection Of Segments for Audit | Use the audit workers to select segments from the audit queue and verify that these segments selected for audit are chosen with uniform probability for every node. |
| | Separate Reservoir for Unvetted Nodes | Use a test network of nodes and simulate different scenarios such as new data being added or nodes being vetted. Use the audit observer to create the audit queue and verify that the segments selected for each node are added to the queue in a random order with different reservoir size for unvetted and vetted nodes. | make sure performance increases with ranged loop |
| | Early Continuation | Should not be allowed to continue early, since an early start means that storage node will only get audits for that certain range covered and we want maximum coverage waiting for other workers should do that |
| | Worker Errors Out | Should be allowed to continue early, since if workers errors out it means that we just have to audit just the range that finished, instead of no audits |
| | Concurrency of Ranged Loop | Need to combine work results, since we want random audits over all segments. |
| | Performance | Measure the time it takes for the audit observer to create the reservoir sampling and the time it takes for the audit workers to select segments from the queue. Measure the memory usage of the system. |
| | Test on Saltlake | Test ARS on saltlake after change to verify that it can handle a large number of segments and nodes. |
| Metrics Verify w/Ranged Loop | Collection of Metrics | Verify that all Storj peers (i.e. satellite, storage node, uplink, etc) are still able to correctly collect the metrics specified in the telemetry client |
| | Storage of Metrics | Verify that metrics are still correctly stored in the specified datastores (postgres, influxDB, rothko) and that they are stored in the correct format and with the correct data |
| | Data Visualization & Accessibility of Metrics | Verify that the metrics data from influxDB are still correctly fed to the grafana dashboard and that the dashboard is able to correctly display the metrics data |
| | Two Segment Loops With Different Ranges Side By Side | Metrics shouldn't be affected only execution time, also total number of segments would have to sum up that both ranged loops would report |
| | Incomplete Metrics | What would happen if one of the concurrent workers fail with error messages? still report metrics- even in incomplete state? |
| Graceful Exit Verify w/Ranged Loop | Graceful Exit Piece Gathering | Verify that the satellite is still able to correctly gather pieces that need to be transferred from a storage node for graceful exit when it is initiated |
| | Graceful Exit Failure/Error Prevention | Verify that the protocol for transferring pieces from one storage node to another is still working as expected, including retransmission of pieces in case of errors or failures |
| | Graceful Exit Reports | Verify that the satellite operator is still able to create accurate and complete reports for exited storage nodes and that the reports include all necessary information for releasing escrows |
| | Ungraceful Exit Handling | Verify that the system is still able to handle ungraceful exits correctly, including cases where the storage node doesn't transfer pieces, transfers pieces incorrectly, is too slow to transfer pieces, or decides to terminate the process. |
| | Graceful Exit Failure Handling | Verify that the system is still able to handle failures like network failures, system failures, and other failures that may occur during the graceful exit process. |
| | Graceful Exit Error Handling | Verify that the system is still able to handle errors like piece hash mismatch, error in piece transfer, and other errors that may occur during the graceful exit process. |
| | Early Continuation | Should not be allowed to continue early, otherwise all pieces would not be GE'd and there would be an issue- accidentally delete some segments in the sense of only having to transfer some part of the data |
| | Worker Errors Out | Should not be allowed to start for same reason, otherwise all pieces would not be GE'd and there would be an issue- accidentally delete some segments same as above |
| | Concurrency of Ranged Loop | Doesn't need to combine work results, just needs to wait for all the graceful exit workers to be done together and error free since it will just send work results all at once to a db table and if not then storagenode will get away with only having to transfer a part of the data |
| Repair Checker Verify w/Ranged Loop | Check for Missing Pieces | Verify the repair checker can still use the necessary info to do checks for missing pieces and insert segments into injured_segments table |
| | Add Segment to Queue | Verify that repair checker can still add segment to the queue |
| | Ensure Check | Verify that repair checker should be able to check before a satellite restart |
| | Healthy Segments | Verify that repair checker could still clean the queue of segments that are deemed healthy |
| | Initiate Repair | Verify that repair checker initiates repair when # of healthy pieces is less than or equal to repair threshold or greater than or equal to min required pieces in redundancy |
| | Irreparable Segments | Verify that repair checker can still monitor irreparable segments |
| | Early Continuation | Should not be allowed to continue early in concurrency, since if it is allowed to commit work results into the database for repair worker to start then the problem is we delete all segments from the repair queue that the specific repair checker that finished early has not seen ex. every other segment from the "slow repair checkers" |
| | Worker Errors Out | Should not be allowed to continue early in concurrency, since if it is allowed to commit work results into the database for repair worker to start then the problem is we delete all segments from the repair queue that the repair checkers have not seen in the current run ex. segments from the repair checkers that errored out. |
| | Concurrency of Ranged Loop | Doesn't need to combine work results, just needs to wait for all the repair checker workers to be done error free and have their segments added to the repair queue, since the repair queue cleanup code should only run after all workers successfully finished |
| Storage Node Tally Verify w/Ranged Loop | At Rest Usage | Verify Tally can still calculate data-at-rest usage correctly |
| | Live Accounting 1 | Verify that by the end of tally iteration one can have the project totals from cache seen during metainfo loop and assigned to latestLiveTotals |
| | Live Accounting 2 | Also verify that the delta between latestLiveTotals and initialLiveTotals is correct with respect to previous test |
| | Early Continuation | Should not be allowed to continue early, since the results of all works is required before continuing and a correct tally calculation is based off of the time difference between 2 runs |
| | Worker Errors Out | Should not be allowed to continue if one worker errors out, since the results of all works is required before continuing, don't want to feed incomplete data to db table, and a correct tally calculation is based off of the time difference between 2 runs |
| | Concurrency of Ranged Loop | Work results need to be combined before committing to the database, otherwise tally calculation would be incorrect since sudden multiple rows in db will screw up data |