Currently the slow db was sleeping for 1s and the timeout for audit was
1s. There's a slight chance that the timeout won't trigger on such a
small difference.
Increase the slow node sleep to 10x of the timeout.
Hopefully fixes#4268
Change-Id: Ifdab45141b3fc7c62bde11813dbc534b3255fe59
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.
Also pass ctx to all the funcs so we can handle sleep better.
Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.
Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
When a node gets enough timeouts, it is supposed to be removed
from pending_audits and get an audit failure. We would give them
a failure, but we missed the removal. This change fixes it.
Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b
Currently, pending audit is finding segment by segment location
(path) because we want to move audit to segmentloop and we will
have only StreamID and Position we need to add columns for those
fields. Altering existing table can cause issues while
migration and deployment. Cleaner choise is to make new table.
This change contains migration with new segment_pending_audit
table that will replace pending_audits table and adjustments
to use new table in the code.
Table pending_audits will be dropped with next release.
Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.
Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
When audits are being recorded, we automatically add some SQL to remove
the node from the pending audits table in case it exists. They are
removed from pending audits even if the node was offline for the audit.
This is not the correct behavior.
Add statement to record audit results in reverify tests to ensure no
more false positives.
Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy.
Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.
For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.
Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
A year ago we made the audit service deleting expired segments.
Meanwhile, we introduced an expired deletetion sub-service in the
metainfo service which sole purpose is deleting expired segments.
Therefore, now we are removing this responsibility from the audit
service. It will continue to avoid reporting failures on expired
segments, but it would not delete them anymore.
We do this to cleanup responsibilities in advance of the metainfo
refactoring.
Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.
A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.
The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.
Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.
Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.
So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.
We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.
Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200
* The audit worker wants to get items from the queue and process them.
* The audit chore wants to create new queues and swap them in when the
old queue has been processed.
This change adds a "Queues" struct which handles the concurrency
issues around the worker fetching a queue and the chore swapping a new
queue in. It simplifies the logic of the "Queue" struct to its bare
bones, so that it behaves like a normal queue with no need to understand
the details of swapping and worker/chore interactions.
Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b
satellite.DB.Console().Projects().GetAll database query
can be replaced with planet.Uplinks[0].Projects[0].ID
Change-Id: I73b82b91afb2dde7b690917345b798f9d81f6831
If a segment is deleted, is modified, or expires during an audit, this
is not problematic, so we should not return errors. Functionally,
nothing changes, but our metrics around audit success rate will be
improved after this change.
Change-Id: Ic11df056b2c73894b67a55894bd4d58c00470606
* Do not swap the active audit queue with the pending audit queue until
the active audit queue is empty.
* Do not begin creating a new pending audit queue until the existing
pending audit queue has been swapped to the active queue.
Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317
This test failed due to a timeout on a download which is supposed to
succeed. The testplanet default for the value is 5 seconds, but here
it is 500 milliseconds.
It looks like this is due to the fact that later in the test we need to
wait for a slow node to timeout, so we cut the timeout shorter to reduce
test time.
This PR increases the timeout to 1 second. Still not too long to wait, but
gives us twice as much time to download, decreasing the likelihood that we
see the timeout error.
Change-Id: I504db39ab5dc4d3c505520337b258265d6da7020
Instead of providing the database from outside to testplanet create it
inside and then allow wrapping and modifying it. This is more convenient
to use.
Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
- Previously, checkSegmentAltered only checked for segments that were replaced
but we want to detect all changes to a segment that occurred while an audit was being conducted.
- Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified
segments were not being removed from containment mode.
- Filled in gaps in reverify testing to ensure nodes are properly removed from containment.
Change-Id: Icd96d369278987200fd28581395725438972b292
With this change RS configuration will be set on satellite. Uplink with
get RS values with BeginObject request and will use it. For backward
compatibility and to avoid super large change redundancy scheme stored
with bucket is not touched. This can be done in future.
Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff
Remove direct dependency on uplink.RSConfig, this simplifies
moving the config file without introducing weird dependencies.
Change-Id: I7fd2a145401e0205d7047631df9d2810241efeec
* skip unknown errors (wip)
* add tests to make sure nodes that time out are added to containment
* add bad blobs store
* call "Skipped" "Unknown"
* add tests to ensure unknown errors do not trigger containment
* add monkit stats to lockfile
* typo
* add periods to end of bad blobs comments
* during audit Verify, return error and delete segment if segment is expired
* delete "main" reverify segment and return error if expired
* delete contained nodes and pointers when pointers to audit are expired
* update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now
* Revert "update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now"
This reverts commit e9066151cf84afbff0929a6007e641711a56b6e5.
* do not count ExpirationDate=time.Time{} as expired
libuplink was incorrectly setting timeouts to 10 seconds still, but
should have been at least 10 minutes. the order sender was setting them
to 1 hour. we don't want timeouts in uplink-side logic as it establishes
a minimum rate on tcp streams.
instead of all of this, just use tcp keep alive. tcp keep alive packets are
sent every 15 seconds and if the peer stops responding the connection
dies. this is enabled by default with go. this will kill tcp connections
when they stop working.
Change-Id: I3d7ad49f71950b3eb43044eedf4b17993116045b
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.
most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.
a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.
Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
* add test to make sure we will reverify the share in the containment db rather than in the pointer passed into reverify
* use pending audit information only when running reverify
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair