stats/size/count is not used by any production code, and it's not required, as we can assert the state with other checks.
real motivation: next commits will make the Selector of the State configurable, therefore we won't have one single Stat, it depends on the request parameters.
(we plan to support both network and id based randomization)
Change-Id: I631828fc0046d2fef5b7a674fc0268a0446e9655
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.
This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.
Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.
The 90% of this patch is just search and replace:
* we need to use NodeFilters instead of placement.AllowedCountry
* which means, we need an initialized PlacementRules available everywhere
* which means we need to configure the placement rules
The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.
Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload,
but for download, repair, etc...
Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder
With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.
https://github.com/storj/storj/issues/5998
Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.
Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.
We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.
This this first part of changes.
https://github.com/storj/storj/issues/5998
Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
We use two different Node types in `overlay` and `uploadnodeselection` and converting back and forth.
Using the same object would allow us to use a unified node selection interface everywhere.
Change-Id: Ie71e29d60184ee0e5b4547eb54325f09c418f73c
Currently we are using KnownUnreliableOrOffline to get missing pieces
for segment repairer (GetMissingPieces). The issue is that now repairer
is looking at more things than just missing pieces (clumped/off
placement pieces).
KnownReliable was refactored to get data (e.g. country, lastNet) about
all reliable nodes from provided list. List is split into online and
offline. This way we will be able to use results from this method to all
checks: missing pieces, clumped pieces, out of placement pieces.
This this first part of changes to handle different kind of pieces in
segment repairer.
https://github.com/storj/storj/issues/5998
Change-Id: I6cbaf59cff9d6c4346ace75bb814ccd985c0e43e
We would like to verify if nodes matches specific placement e.g. to
validate segment pieces are correctly geofenced.
https://github.com/storj/storj/issues/5896
Change-Id: I842767dccc121a3c60224f677ab55e5dc150c76e
It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail.
Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service.
Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7
Up to now, we have been implementing the DistinctIP preference with code
in two places:
1. On check-in, the last_net is determined by taking the /24 or /64
(in ResolveIPAndNetwork()) and we store it with the node record.
2. On node selection, a preference parameter defines whether to return
results that are distinct on last_net.
It can be observed that we have never yet had the need to switch from
DistinctIP to !DistinctIP, or from !DistinctIP to DistinctIP, on the
same satellite, and we will probably never need to do so in an automated
way. It can also be observed that this arrangement makes tests more
complicated, because we often have to arrange for test nodes to have IP
addresses in different /24 networks (a particular pain on macOS).
Those two considerations, plus some pending work on the repair framework
that will make repair take last_net into consideration, motivate this
change.
With this change, in the #2 place, we will _always_ return results that
are distinct on last_net. We implement the DistinctIP preference, then,
by making the #1 place (ResolveIPAndNetwork()) more flexible. When
DistinctIP is enabled, last_net will be calculated as it was before. But
when DistinctIP is _off_, last_net can be the same as address (IP and
port). That will effectively implement !DistinctIP because every
record will have a distinct last_net already.
As a side effect, this flexibility will allow us to change the rules
about last_net construction arbitrarily. We can do tests where last_net
is set to the source IP, or to a /30 prefix, or a /16 prefix, etc., and
be able to exercise the production logic without requiring a virtual
network bridge.
This change should be safe to make without any migration code, because
all known production satellite deployments use DistinctIP, and the
associated last_net values will not change for them. They will only
change for satellites with !DistinctIP, which are mostly test
deployments that can be recreated trivially. For those satellites which
are both permanent and !DistinctIP, node selection will suddenly start
acting as though DistinctIP is enabled, until the operator runs a single
SQL update "UPDATE nodes SET last_net = last_ip_port". That can be done
either before or after deploying software with this change.
I also assert that this will not hurt performance for production
deployments. It's true that adding the distinct requirement to node
selection makes things a little slower, but the distinct requirement is
already present for all production deployments, and they will see no
change.
Refs: https://github.com/storj/storj/issues/5391
Change-Id: I0e7e92498c3da768df5b4d5fb213dcd2d4862924
we have two more fields in the database (noise_proto and
noise_public_key) that now need to go into pb.NodeAddress when
returning AddressedOrderLimits.
the only real complication is making sure type conversions between
database types and NodeURLs and so on don't lose this new
pb.NodeAddress field (NoiseInfo). otherwise this is a relatively
straightforward commit
Change-Id: I45b59d7b2d3ae21c2e6eb95497f07cd388d454b3
The CASE expression used to determine which value to set
last_software_update_email to did not have an ELSE clause. Therefore,
when the node is both below the minimum version and did not receive a
version update email (no condition is true), the value would be set to
NULL.
Additionally, replace `time.Now()` with `timestamp` in the check to
determine if the email cooldown has passed.
Change-Id: I2e2e93f1a865e123ed8b665be9621cebfb72236f
`overlay.(*Service).UpdateReputation()` takes a "reputationChanges"
parameter, a slice of node events indicating whether we think the node's
disqualification or suspension status is changing. This is necessary so
that the overlay service can notify the nodeevents DB about these
changes.
In several cases, however, this list of events is not constructed
correctly, because of missing information about the previous state.
In most cases, this is because the node was offline, and the order limit
creation functions (which usually obtain and return the prior reputation
status) ignored that node.
This change makes it so that all callers to
`overlay.(*Service).UpdateReputation()` can be expected to provide a
correct list of change events (as correct as feasible, given that we
can't lock the node's information in the database during the entire
operation).
It ended up that there was only one caller we needed to worry about, and
that was reputation.(*Service).ApplyAudit(). So the bulk of this change
is teaching that function how to recognize when the prior reputation
status was not filled in, and fill it in.
Refs: https://github.com/storj/storj/issues/5464
Change-Id: I52ce385fc9c0ce3b283b998d517998e7f4ec8792
Add LastOfflineEmail to overlay.NodeDossier. This is the last time a
node got an offline email. Add two new overlay db methods,
GetOfflineNodesForEmail and UpdateLastOfflineEmail. Edit db method
UpdateCheckIn to nullify last_offline_email if node is up.
Change-Id: I1ee60e7d98dd1b68348a57f9a4fb77c6c9895d6d
When a node checks in and its version is below the minimum, insert
BelowMinVersion event into node events
Change-Id: I0e437ac34496778369515cbc40c15676da8b27ae
Instead of sending emails at the time the node is seen to be back
online, we have decided to send the event to the node events table,
which will initiate the email sending process at some point.
Change-Id: Id756209498112579de8e78ee20ad2df54571a617
Add nodeevents.DB to satellite overlay service so we can insert node
events into the nodeevents DB.
Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156
We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails
Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
Set disqualification reason when reputations stats are updated on DB.Update.
Added tests for DisqualifyNode and for disqualification cases which happens during Update.
Change-Id: I00130ab5d9722422805159ad2f183c205de60f7e
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.
Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.
Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.
Another change will handle the logic at the repairer level.
Fixes https://github.com/storj/team-metainfo/issues/95
Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
The database table got invalid input and the resulting error
was not checked. This adds updates that contain invalid fields
to trigger different errors.
Change-Id: Iacea32cbef5599aab562c88e4113073596cc9996
inconsistency
The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.
Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.
Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
gracefulexit
With the effort to move audit related data into reputation store, this
PR updates gracefulexit endpoint to use reputation service to get a
node's audit score
Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a
package in audit
This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.
In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.
Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
Define service and DB interface for storing node reputation data
and updating the overlay cache.
Add overlay service and DB method UpdateReputation.
See https://github.com/storj/storj/pull/4144
Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.
Change-Id: Idb829d742e1f987e095604423fff656fe581183e
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.
Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.