This method is sometimes ends with transaction error. Most probably
because it's trying to do full table scan on nodes table which is
heavily used. Adding AOST should help with DB contention.
Change-Id: Ibd4358d28dc26922b60c6b30862f20e7c0662cd1
as GetParticipatingNodes and GetNodes, respectively.
We now want these functions to include offline and suspended nodes as
well, so that we can force immediate repair when pieces are out of
placement or in excluded countries. With that change, the old names no
longer made sense.
Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7
Once uppon a time, at the dawn of the implementation of Storj, when all the nodes are read from the database directly, every time.
After a while -- due to performance reasons -- it has been changed for upload and download: where all the nodes are read for a short period of time, and used from memory.
This is the version which was improved recently to support advanced node selections using placement.
But stil we have an old configuration value `service.config.NodeSelectionCache.Disabled`, and the db based implementation: `service.FindStorageNodesWithPreferences(ctx, req, &service.config.Node)`.
For safety, we need to remove this option, to make sure that we use the cache, which has the advanced features.
This patch was supposed to be a very small one (just removing a method and a config: https://review.dev.storj.io/c/storj/storj/+/11074/1/satellite/overlay/service.go), but it turned out that we need to update a lot of unit tests.
These unit tests used the old implementation (which is not used in production any more).
The tests which used both implementation are just updated to use only the new one
The tests which used only the old implementation are refactored (but keeping the test cases).
Using real unit tests (without DB, working on OSX, fast)
Closes https://github.com/storj/storj/issues/6217
Change-Id: I023f92c7e34235665cf8474513e67b2fcc4763eb
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.
This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.
Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.
The 90% of this patch is just search and replace:
* we need to use NodeFilters instead of placement.AllowedCountry
* which means, we need an initialized PlacementRules available everywhere
* which means we need to configure the placement rules
The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.
Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload,
but for download, repair, etc...
Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder
With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.
https://github.com/storj/storj/issues/5998
Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.
Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.
We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.
This this first part of changes.
https://github.com/storj/storj/issues/5998
Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
We use two different Node types in `overlay` and `uploadnodeselection` and converting back and forth.
Using the same object would allow us to use a unified node selection interface everywhere.
Change-Id: Ie71e29d60184ee0e5b4547eb54325f09c418f73c
Currently we are using KnownUnreliableOrOffline to get missing pieces
for segment repairer (GetMissingPieces). The issue is that now repairer
is looking at more things than just missing pieces (clumped/off
placement pieces).
KnownReliable was refactored to get data (e.g. country, lastNet) about
all reliable nodes from provided list. List is split into online and
offline. This way we will be able to use results from this method to all
checks: missing pieces, clumped pieces, out of placement pieces.
This this first part of changes to handle different kind of pieces in
segment repairer.
https://github.com/storj/storj/issues/5998
Change-Id: I6cbaf59cff9d6c4346ace75bb814ccd985c0e43e
For now we will use bucket placement to determine if we should exclude
some node IPs from metainfo.GetObjectIPs results. Bucket placement is
retrieved directly from DB in parallel to metabase
GetStreamPieceCountByNodeID request.
GetObjectIPs is not heavily used so additional request to DB shouldn't
be a problem for now.
https://github.com/storj/storj/issues/5950
Change-Id: Idf58b1cfbcd1afff5f23868ba2f71ce239f42439
We would like to verify if nodes matches specific placement e.g. to
validate segment pieces are correctly geofenced.
https://github.com/storj/storj/issues/5896
Change-Id: I842767dccc121a3c60224f677ab55e5dc150c76e
Methods SelectAllStorageNodesUpload and SelectAllStorageNodesDownload
are not returning full info with overlay.SelectedNode because its
missing CountryCode.
Change-Id: Ie3cb396bf28d7ec4c6ab8927e5bb560236036aa6
We were using the UploadSelectionCache previously, which does _not_ have
all nodes, or even all online nodes, in it. So all nodes with less than
MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended
for unknown audit errors, or nodes that have started graceful exit, were
all missing, and ended up having empty last_nets. Even with all that,
I'm kind of surprised how many nodes this involved, but using the upload
selection cache was definitely wrong.
This change uses the download selection cache instead, which excludes
nodes only when they are disqualified, gracefully exited (completely),
or offline.
Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1
The query for GetNodesNetworkInOrder is causing far too much load on the
database. Since it is not critical that the repair checker have
perfectly up-to-date node network information, we can use a cache
instead.
Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.
This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).
Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
We have a specific issue that a user uploaded a file to a bucket
geo-fenced to EU and one of the pieces appeared to be on a node in the
US. The country code of this node is set to SE (Sweden) in the satellite
DB. It turns out that some time ago MaxMind changed the country code of
this node's IP from Sweden to US, but this change hasn't been reflected
in the satellite's database.
So far the satellite updates the country code of a node only if its IP
changes. It was assumed that if the IP does not change, its country code
shouldn't change too. This turned to be a wrong assumption.
With this change, the satellite will look up the MaxMindDB on every
check-in to see if the country code of the node's IP has changed.
Change-Id: Icdf659b09be9fc6ad14601902032b35ba5ea78c4
This test involves a satellite with dev defaults (DistinctIP=no) being
upgraded past commit 2522ff09b6, which
means we need to run the dev-defaults-satellite-upgrade migration SQL
to avoid getting DistinctIP=yes behavior (which breaks the tests).
Change-Id: I29fb596d1ffa568dad635d98cfe9abacd3aaa48f
It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail.
Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service.
Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7
Up to now, we have been implementing the DistinctIP preference with code
in two places:
1. On check-in, the last_net is determined by taking the /24 or /64
(in ResolveIPAndNetwork()) and we store it with the node record.
2. On node selection, a preference parameter defines whether to return
results that are distinct on last_net.
It can be observed that we have never yet had the need to switch from
DistinctIP to !DistinctIP, or from !DistinctIP to DistinctIP, on the
same satellite, and we will probably never need to do so in an automated
way. It can also be observed that this arrangement makes tests more
complicated, because we often have to arrange for test nodes to have IP
addresses in different /24 networks (a particular pain on macOS).
Those two considerations, plus some pending work on the repair framework
that will make repair take last_net into consideration, motivate this
change.
With this change, in the #2 place, we will _always_ return results that
are distinct on last_net. We implement the DistinctIP preference, then,
by making the #1 place (ResolveIPAndNetwork()) more flexible. When
DistinctIP is enabled, last_net will be calculated as it was before. But
when DistinctIP is _off_, last_net can be the same as address (IP and
port). That will effectively implement !DistinctIP because every
record will have a distinct last_net already.
As a side effect, this flexibility will allow us to change the rules
about last_net construction arbitrarily. We can do tests where last_net
is set to the source IP, or to a /30 prefix, or a /16 prefix, etc., and
be able to exercise the production logic without requiring a virtual
network bridge.
This change should be safe to make without any migration code, because
all known production satellite deployments use DistinctIP, and the
associated last_net values will not change for them. They will only
change for satellites with !DistinctIP, which are mostly test
deployments that can be recreated trivially. For those satellites which
are both permanent and !DistinctIP, node selection will suddenly start
acting as though DistinctIP is enabled, until the operator runs a single
SQL update "UPDATE nodes SET last_net = last_ip_port". That can be done
either before or after deploying software with this change.
I also assert that this will not hurt performance for production
deployments. It's true that adding the distinct requirement to node
selection makes things a little slower, but the distinct requirement is
already present for all production deployments, and they will see no
change.
Refs: https://github.com/storj/storj/issues/5391
Change-Id: I0e7e92498c3da768df5b4d5fb213dcd2d4862924
We will be needing an infrequent chore to check which nodes are in the
reverify queue and synchronize that set with the 'contained' field in
the nodes db, since it is easily possible for them to get out of sync.
(We can't require that the reverification queue table be in the same
database as the nodes table, so maintaining consistency with SQL
transactions is out. Plus, even if they were in the same database, using
such SQL transactions to maintain consistency would be slow and
unwieldy.)
This commit adds a method to the overlay allowing the caller to set the
contained status of all nodes in the nodes table at once. This is valid
because our definition of "contained" now depends solely on whether a
node appears at least once in the reverification queue. Only rows whose
contained field does not match the expectation will be updated; the
contained timestamp will not be updated for a node which is supposed to
be contained and was already contained.
Change-Id: I8cabe56ad897b6027e11aa5b17175295391aa3ac
we have two more fields in the database (noise_proto and
noise_public_key) that now need to go into pb.NodeAddress when
returning AddressedOrderLimits.
the only real complication is making sure type conversions between
database types and NodeURLs and so on don't lose this new
pb.NodeAddress field (NoiseInfo). otherwise this is a relatively
straightforward commit
Change-Id: I45b59d7b2d3ae21c2e6eb95497f07cd388d454b3
The CASE expression used to determine which value to set
last_software_update_email to did not have an ELSE clause. Therefore,
when the node is both below the minimum version and did not receive a
version update email (no condition is true), the value would be set to
NULL.
Additionally, replace `time.Now()` with `timestamp` in the check to
determine if the email cooldown has passed.
Change-Id: I2e2e93f1a865e123ed8b665be9621cebfb72236f
SetNodeContained() will change the contained flag in the nodes table,
which will affect whether nodes are selected for new uploads. This flag
_should_ correlate with whether or not a given node has any entries in
the reverification queue. However, the reverification queue is intended
to be 'safely partitionable' from the nodes table, so we can't enforce
that characteristic transactionally. But this is ok; there are no dire
consequences if they are out of sync.
We will be adding a chore that updates the contained flag based on the
contents of the reverification queue periodically, if something fails
to set it directly when appropriate.
Refs: https://github.com/storj/storj/issues/5231
Change-Id: I26460d8718dee63fd55d00a44568b2065fc8fe30
Add a new chore to periodically insert nodes who are offline and
have not gotten an offline email in a certain amount of time into node
events
Change-Id: I658b385bb777b0240c98092946a93d65bee94abc
Add LastOfflineEmail to overlay.NodeDossier. This is the last time a
node got an offline email. Add two new overlay db methods,
GetOfflineNodesForEmail and UpdateLastOfflineEmail. Edit db method
UpdateCheckIn to nullify last_offline_email if node is up.
Change-Id: I1ee60e7d98dd1b68348a57f9a4fb77c6c9895d6d
When a node checks in and its version is below the minimum, insert
BelowMinVersion event into node events
Change-Id: I0e437ac34496778369515cbc40c15676da8b27ae
Instead of sending emails at the time the node is seen to be back
online, we have decided to send the event to the node events table,
which will initiate the email sending process at some point.
Change-Id: Id756209498112579de8e78ee20ad2df54571a617
Add nodeevents.DB to satellite overlay service so we can insert node
events into the nodeevents DB.
Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156
We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails
Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
Pieces count in DB are stored as int64 and we would like to align bloom
filter processing with this type.
Change-Id: Iaec767e609a40d802077ae057520541805a7c44f
Currently, the satellite tracks connectivity information about all nodes
that have contacted it, even if we have never successfully contacted
the node back.
This behavior was leveraged during a security audit to create hundreds
of thousands of "junk nodes" in the nodes table on one satellite, which
affected performance of queries such as node selection.
With this change, we should no longer track information about nodes that
have never been successfully contacted.
Note that it will still be possible to cause the creation of "junk
node" entries in the db; the attacker just has to set up individual
publicly-routable IP+port pairs for each node as it is created, so it
can respond to a PingBack.
Change-Id: Ibb6da6cc908fd4fc85aae1ba00313ba2738409ab