We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails
Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
Pieces count in DB are stored as int64 and we would like to align bloom
filter processing with this type.
Change-Id: Iaec767e609a40d802077ae057520541805a7c44f
Currently, the satellite tracks connectivity information about all nodes
that have contacted it, even if we have never successfully contacted
the node back.
This behavior was leveraged during a security audit to create hundreds
of thousands of "junk nodes" in the nodes table on one satellite, which
affected performance of queries such as node selection.
With this change, we should no longer track information about nodes that
have never been successfully contacted.
Note that it will still be possible to cause the creation of "junk
node" entries in the db; the attacker just has to set up individual
publicly-routable IP+port pairs for each node as it is created, so it
can respond to a PingBack.
Change-Id: Ibb6da6cc908fd4fc85aae1ba00313ba2738409ab
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.
For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at
https://forum.storj.io/t/tuning-audit-scoring/14084
In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.
Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.
Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
Use DownloadSelectionCache to avoid querying database for every
download.
This change only addresses downloads from users. The download selection
cache is not currently used for audit and repair.
Change-Id: I96a49e121dac0b4204f97592a63131edabd73fb5
This test had an effective config.Reputation.AuditCount = 0, meaning all
nodes that had _any_ positive audit results were considered vetted.
Because of that, only one node in the test setup was "new". And that
node was marked as being in GE, so could not be returned by node
selection.
The reason the tests still worked is because of the node selection rule
that says "if there are no new nodes at all, just get all reputable
nodes to satisfy the request".
This commit makes it so half of the nodes are vetted and half new, which
makes the test somewhat more interesting (and means we aren't
concentrating too much on testing details of behavior when AuditCount is
0).
Change-Id: I09157b7dc20ecaddd2a6e60cfe146e9186e3603b
prevent network enumeration by rejecting privateIPs in PingMe and
Checkin endpoints
Closesstorj/storj-private#32
Change-Id: I63f00483ff4128ebd5fa9b7b8da826a5706748c9
Set disqualification reason when reputations stats are updated on DB.Update.
Added tests for DisqualifyNode and for disqualification cases which happens during Update.
Change-Id: I00130ab5d9722422805159ad2f183c205de60f7e
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.
Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
Create global config to specify a list of country codes that should be
excluded from node selection during uploads.
This exclusion is not implemented when the upload selection cache is
disabled.
Change-Id: Ic41e8b4f18857a11045668eac23107da99668a72
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.
Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.
Another change will handle the logic at the repairer level.
Fixes https://github.com/storj/team-metainfo/issues/95
Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
The database table got invalid input and the resulting error
was not checked. This adds updates that contain invalid fields
to trigger different errors.
Change-Id: Iacea32cbef5599aab562c88e4113073596cc9996
inconsistency
The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.
Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
Currently the test threshold for unneeded node checkins are 30s.
The disqualification threshold was at 30s, which means, it was possible
for all the nodes to get disqualified.
Hopefully fixes#4267
Change-Id: I6b0a10c09b7fd90a9729794885c9e7a593781bad
We had a bug in the stray nodes chore where nodes who had not been seen
in several months were not being DQd. We figured out that this was
happening because we were using two queries: The first to grab
nodes where last_contact_success < some cutoff, the second to DQ them
unless last_contact_success == '0001-01-01 00:00:00+00'. The problem
is that if all of the nodes returned from the first query had
last_contact_success of '0001-01-01 00:00:00+00', we would pass them to
the second query which would not DQ them. This would result in the stray
nodes DQ loop ending since we found a number of nodes to DQ less than the
limit.
The fix: add the "WHERE last_contact_success != '0001-01-01
00:00:00+00'::timestamptz" to the selection query.
Change-Id: I4e60de90b68d8745d641b4467c2b23e0e56f7dff
Union query for both reputable and new nodes didn't properly work. The
top level query is required to have an `AS OF SYSTEM TIME` statement as
well.
Change-Id: I8ee6dd5b700c2b1ed2aa562962bfa72be7eec30a
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.
Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
gracefulexit
With the effort to move audit related data into reputation store, this
PR updates gracefulexit endpoint to use reputation service to get a
node's audit score
Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a
package in audit
This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.
In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.
Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
Define service and DB interface for storing node reputation data
and updating the overlay cache.
Add overlay service and DB method UpdateReputation.
See https://github.com/storj/storj/pull/4144
Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.
As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.
This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.
In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.
Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
Improve UpdateCheckIn on a contended row:
name old time/op new time/op delta
UpdateCheckInContended-100x-32 2.29s ±55% 0.17s ±61% -92.45% (p=0.008 n=5+5)
Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.
As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.
Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.
Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
Currently nodeselection package only contained state for uploads, move
these to a subpackage, such that we can make another "downloadselection"
for downloads. Then move selection logic from overlay to nodeselection.
Change-Id: I0fc42bcae3a29db2728dae9f3863b1e95bf5165b
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.
Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz
Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.
Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44