The easiest way to get node information WITH node tags is executing two queries:
1. select all nodes
2. select all tags
And we can pair them with a loop, using the in-memory data structures.
But this approach does work only, if we select all nodes, which is true when we use cache (upload, download, repair checker).
But repair process selects only the required nodes, where this approach is suboptimal. (full table scan for all tags, even if we need only tags for a few dozens nodes).
Possible solutions:
1. We can introduce a cache for repair (similar to upload cache)
2. Or we can select both node and tag information with one query (join).
This patch implements the second approach.
Note: repair itself is quite slow (10-20 seconds per segements to repair). With 15 seconds execution time and 3 minutes cache staleness, we would use the cache only 12 times per worker. Probably we don't need cache for now.
https://github.com/storj/storj/issues/6198
Change-Id: I0364d94306e9815a1c280b71e843b8f504e3d870
as GetParticipatingNodes and GetNodes, respectively.
We now want these functions to include offline and suspended nodes as
well, so that we can force immediate repair when pieces are out of
placement or in excluded countries. With that change, the old names no
longer made sense.
Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7
This patch fixes the node tag based placement of rangedloop/repairchecker + repair process.
The main change is just adding the node tags for Reliable and KnownReliabel database calls + adding new tests to prove, it works.
https://github.com/storj/storj/issues/6126
Change-Id: I245d654a18c1d61b2c72df49afa0718d0de76da1
In the repair subsystem, it is necessary to acquire several extra
properties of nodes that are holding pieces of things or may be
selected to hold pieces. We need to know if a node is 'online' (the
definition of "online" may change somewhat depending on the situation),
if a node is in the process of graceful exit, and whether a node is
suspended. We can't just filter out nodes with all of these properties,
because sometimes we need to know properties about nodes even when the
nodes are suspended or gracefully exiting.
I thought the best way to do this was to add fields to SelectedNode,
and (to avoid any confusion) arrange for the added fields to be
populated wherever SelectedNode is returned, whether or not the new
fields are necessarily going to be used.
If people would rather I use a separate type from SelectedNode, I can do
that instead.
Change-Id: I7804a0e0a15cfe34c8ff47a227175ea5862a4ebc
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.
This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.
Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload,
but for download, repair, etc...
Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.
Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.
We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.
This this first part of changes.
https://github.com/storj/storj/issues/5998
Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
We use two different Node types in `overlay` and `uploadnodeselection` and converting back and forth.
Using the same object would allow us to use a unified node selection interface everywhere.
Change-Id: Ie71e29d60184ee0e5b4547eb54325f09c418f73c
Currently we are using KnownUnreliableOrOffline to get missing pieces
for segment repairer (GetMissingPieces). The issue is that now repairer
is looking at more things than just missing pieces (clumped/off
placement pieces).
KnownReliable was refactored to get data (e.g. country, lastNet) about
all reliable nodes from provided list. List is split into online and
offline. This way we will be able to use results from this method to all
checks: missing pieces, clumped pieces, out of placement pieces.
This this first part of changes to handle different kind of pieces in
segment repairer.
https://github.com/storj/storj/issues/5998
Change-Id: I6cbaf59cff9d6c4346ace75bb814ccd985c0e43e
Methods SelectAllStorageNodesUpload and SelectAllStorageNodesDownload
are not returning full info with overlay.SelectedNode because its
missing CountryCode.
Change-Id: Ie3cb396bf28d7ec4c6ab8927e5bb560236036aa6
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.
This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).
Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
This might be pretty awful, but at least it is a complete and non-flaky
solution.
**Only when using the rollingupgrade test** (which implies a throwaway
satellite and also a PostgreSQL backend), create a trigger on the nodes
table which forces last_net to be equal to last_ip_port always.
Change-Id: I8448cf131e46576d96a414d06780270c7b2b1892
This test involves a satellite with dev defaults (DistinctIP=no) being
upgraded past commit 2522ff09b6, which
means we need to run the dev-defaults-satellite-upgrade migration SQL
to avoid getting DistinctIP=yes behavior (which breaks the tests).
Change-Id: I29fb596d1ffa568dad635d98cfe9abacd3aaa48f
We will be needing an infrequent chore to check which nodes are in the
reverify queue and synchronize that set with the 'contained' field in
the nodes db, since it is easily possible for them to get out of sync.
(We can't require that the reverification queue table be in the same
database as the nodes table, so maintaining consistency with SQL
transactions is out. Plus, even if they were in the same database, using
such SQL transactions to maintain consistency would be slow and
unwieldy.)
This commit adds a method to the overlay allowing the caller to set the
contained status of all nodes in the nodes table at once. This is valid
because our definition of "contained" now depends solely on whether a
node appears at least once in the reverification queue. Only rows whose
contained field does not match the expectation will be updated; the
contained timestamp will not be updated for a node which is supposed to
be contained and was already contained.
Change-Id: I8cabe56ad897b6027e11aa5b17175295391aa3ac
we have two more fields in the database (noise_proto and
noise_public_key) that now need to go into pb.NodeAddress when
returning AddressedOrderLimits.
the only real complication is making sure type conversions between
database types and NodeURLs and so on don't lose this new
pb.NodeAddress field (NoiseInfo). otherwise this is a relatively
straightforward commit
Change-Id: I45b59d7b2d3ae21c2e6eb95497f07cd388d454b3
The CASE expression used to determine which value to set
last_software_update_email to did not have an ELSE clause. Therefore,
when the node is both below the minimum version and did not receive a
version update email (no condition is true), the value would be set to
NULL.
Additionally, replace `time.Now()` with `timestamp` in the check to
determine if the email cooldown has passed.
Change-Id: I2e2e93f1a865e123ed8b665be9621cebfb72236f
`overlay.(*Service).UpdateReputation()` takes a "reputationChanges"
parameter, a slice of node events indicating whether we think the node's
disqualification or suspension status is changing. This is necessary so
that the overlay service can notify the nodeevents DB about these
changes.
In several cases, however, this list of events is not constructed
correctly, because of missing information about the previous state.
In most cases, this is because the node was offline, and the order limit
creation functions (which usually obtain and return the prior reputation
status) ignored that node.
This change makes it so that all callers to
`overlay.(*Service).UpdateReputation()` can be expected to provide a
correct list of change events (as correct as feasible, given that we
can't lock the node's information in the database during the entire
operation).
It ended up that there was only one caller we needed to worry about, and
that was reputation.(*Service).ApplyAudit(). So the bulk of this change
is teaching that function how to recognize when the prior reputation
status was not filled in, and fill it in.
Refs: https://github.com/storj/storj/issues/5464
Change-Id: I52ce385fc9c0ce3b283b998d517998e7f4ec8792
SetNodeContained() will change the contained flag in the nodes table,
which will affect whether nodes are selected for new uploads. This flag
_should_ correlate with whether or not a given node has any entries in
the reverification queue. However, the reverification queue is intended
to be 'safely partitionable' from the nodes table, so we can't enforce
that characteristic transactionally. But this is ok; there are no dire
consequences if they are out of sync.
We will be adding a chore that updates the contained flag based on the
contents of the reverification queue periodically, if something fails
to set it directly when appropriate.
Refs: https://github.com/storj/storj/issues/5231
Change-Id: I26460d8718dee63fd55d00a44568b2065fc8fe30
Add LastOfflineEmail to overlay.NodeDossier. This is the last time a
node got an offline email. Add two new overlay db methods,
GetOfflineNodesForEmail and UpdateLastOfflineEmail. Edit db method
UpdateCheckIn to nullify last_offline_email if node is up.
Change-Id: I1ee60e7d98dd1b68348a57f9a4fb77c6c9895d6d
When a node checks in and its version is below the minimum, insert
BelowMinVersion event into node events
Change-Id: I0e437ac34496778369515cbc40c15676da8b27ae
Pieces count in DB are stored as int64 and we would like to align bloom
filter processing with this type.
Change-Id: Iaec767e609a40d802077ae057520541805a7c44f
We added nodes.disqualification_reason recently, but we didn't add a
corresponding column in the reputations table (despite having a
corresponding `disqualified` column there).
Without this change, the (very useful and informative) assignments to
updateFields.DisqualificationReason in reputations.go have no effect.
Refs: https://github.com/storj/storj/issues/4601
Change-Id: I77404902ca64b56aed72f1de76b303fe82b76aab
Set disqualification reason when reputations stats are updated on DB.Update.
Added tests for DisqualifyNode and for disqualification cases which happens during Update.
Change-Id: I00130ab5d9722422805159ad2f183c205de60f7e
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.
Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.
Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.
Another change will handle the logic at the repairer level.
Fixes https://github.com/storj/team-metainfo/issues/95
Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
inconsistency
The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.
Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
We had a bug in the stray nodes chore where nodes who had not been seen
in several months were not being DQd. We figured out that this was
happening because we were using two queries: The first to grab
nodes where last_contact_success < some cutoff, the second to DQ them
unless last_contact_success == '0001-01-01 00:00:00+00'. The problem
is that if all of the nodes returned from the first query had
last_contact_success of '0001-01-01 00:00:00+00', we would pass them to
the second query which would not DQ them. This would result in the stray
nodes DQ loop ending since we found a number of nodes to DQ less than the
limit.
The fix: add the "WHERE last_contact_success != '0001-01-01
00:00:00+00'::timestamptz" to the selection query.
Change-Id: I4e60de90b68d8745d641b4467c2b23e0e56f7dff
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.
Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
nodes and audit_history tables
This PR removes all code reference to audit_histories table and
```
audit_reputation_alpha, audit_reputation_beta,
unknown_audit_reputation_alpha, unknown_audit_reputation_beta,
```
columns from nodes table.
It also drops audit_histories table from the db since the code
that's referencing it currently are not being used.
Change-Id: Ifcda8db36afb3a333d487ff831f2fdefc8b02a4c