* Don't use rpcstatus.Unknown as an indicator of dial failure; instead,
GetShare now indicates with a per-share field where a failure happened
(DialFailure, RequestFailure, NoFailure). Use that information in
Verify() to determine how to treat the source node.
* Add a test that replaces a storage node with a black hole, so that
connections there will always time out. Make sure we handle that case
correctly.
Refs: https://github.com/storj/storj/issues/5632
Change-Id: I513a53520fb48b7187d4c4d7e14e01e2cfc0a721
This change implements the ranged loop observer to replace the audit
chore that builds the audit queue.
The strategy employed by this change is to use a collector for each
segment range to build separate per-node segment reservoirs that are
then merge them during the join step.
In previous observer migrations, there were only a handful of tests so
the strategy was to duplicate them. In this package, there are dozens
of tests that utilize the chore. To reduce code churn and maintenance
burden until the chore is removed, this change introduces a helper that
runs tests under both the chore and observer, providing a pair of
functions that can be used to pause or run the queueing function.
https://github.com/storj/storj/issues/5232
Change-Id: I8bb4b4e55cf98b1aac9f26307e3a9a355cb3f506
Now that all the reverification changes have been made and the old code
is out of the way, this commit renames the new things back to the old
names. Mostly, this involves renaming "newContainment" to "containment"
or "NewContainment" to "Containment", but there are a few other renames
that have been promised and are carried out here.
Refs: https://github.com/storj/storj/issues/5230
Change-Id: I34e2b857ea338acbb8421cdac18b17f2974f233c
Now that we are doing scalable piecewise reverifications, the code for
handling the old way of doing things (containment, pending audits,
reporting, testing) can now be removed.
Refs: https://github.com/storj/storj/issues/5230
Change-Id: Ief1a75f423eff682e8f3d57804e343b3409a6631
NewContainment will replace Containment later in this commit chain, but
for now it is not yet being used.
NewContainment will allow a node to be contained for multiple pending
reverify jobs at a time. It is implemented by way of the reverify queue.
Refs: https://github.com/storj/storj/issues/5231
Change-Id: I126eda0b3dfc4710a88fe4a5f41780618ec19101
As part of the effort of splitting out the auditor workers to their own
process, we are transitioning the communication between the auditor
chore and the verification workers to a queue implemented in the
database, rather than the sequence of in-memory queues we used to use.
This logical database is safely partitionable from the rest of
satelliteDB.
Refs: https://github.com/storj/storj/issues/5251
Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.
Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.
A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.
The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.
Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.
Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.
So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.
We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.
Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200