storj/docs/testplan/project-cowbell-testplan.md

7.0 KiB

Mini Cowbell Testplan

 

Background

We want to deploy the entire Storj stack on environments that have kubernetes running on 5 NUCs.

 

Pre-condition

Configuration for satellites that only have 5 node and the recommended RS scheme is [2,3,4,4] where:

  • 2 is the number of required pieces to reconstitute the segment.
  • 3 is the repair threshold, i.e. if a segment remains with only 3 healthy pieces, it will be repaired.
  • 4 is the success threshold, i.e. the number of pieces required for a successful upload or repair.
  • 4 is the number of total erasure-coded pieces that will be generated.
Test Scenario Test Case Description Comments
Upload Upload with all nodes online Every file is uploaded to 4 nodes with 2x expansion factor. So one node has no files. Happy path scenario
Upload with one node offline If one of five nodes fails and goes offline, 80% of the stored data will lose one erasure-coded piece. The health status of these segments will be reduced from 4 pieces to 3 pieces and will mark these segments for repair. overlay.node.online-window: 4h0m0s -> for about 4 hours the node will still be selected for uploads) Uploads will continue uninterrupted if the client uses the new refactored upload path. This improved upload logic will request the satellite for a new node if the satellite selects the offline node for the upload, unaware it is already offline. If the client uses the old upload logic, uploads may fail if the satellite selects the offline node (20% chance). When the satellite detects the offline node, all uploads will be successful.
Download Download with one node offline If one of five nodes fails and goes offline, 80% of the stored data will lose one erasure-coded piece. The health status of these segments will be reduced from 4 pieces to 3 pieces and will mark these segments for repair. overlay.node.online-window: 4h0m0s -> for about 4 hours the node will still be selected for downloads)
Repair Repair with 2 nodes disqualified Disqualify 2 nodes so the repair download are still possible but there is no node available for an upload, shouldn't consume download bandwidth and error out early. Only spend download bandwidth when there is at least one node available for an upload If two nodes go offline, there are remaining pieces in the worst case, which cannot be repaired and is a de facto data loss if the offline nodes are damaged.
Audit Audits can't identify corrupted pieces with just the minimum number of pieces. Reputation should not increase. Audits should be able to identify corrupted pieces with minumum + 1 pieces. Reputation should decrease.
Upgrades Nodes restart for upgrades No more than a single node goes offline for maintenance. Otherwise, normal operation of the network cannot be ensured. Occasionally, nodes may need to restart due to software updates. This brings the node offline for some period of time