Orders from storage nodes are received by SettlementWithWindowFinal method. There is a stream which receives all orders and after getting
all orders we are inserting into DB storagenode and bucket bandwidth. Problem is with bucket bandwidth which is stored through cache which is often using context from SettlementWithWindowFinal stream to perform DB inserts and its doing this in separate goroutine. Because of that is possible that SettlementWithWindowFinal is finished before flushing was finished and context is canceled while doing insert into DB
Change-Id: I3a72c86390e9aedc060f6b082bb059f1406231ee
we have two more fields in the database (noise_proto and
noise_public_key) that now need to go into pb.NodeAddress when
returning AddressedOrderLimits.
the only real complication is making sure type conversions between
database types and NodeURLs and so on don't lose this new
pb.NodeAddress field (NoiseInfo). otherwise this is a relatively
straightforward commit
Change-Id: I45b59d7b2d3ae21c2e6eb95497f07cd388d454b3
For bucket_bandwidth_rollups we are trying to insert lots of entries
with empty bucket name and project id. Those are inserts from orders
created by repair, audit and GE. High load on the same primary key
(the same range) is causing many retries and that's affect all inserts
as we are putting 1000 entries into DB a once.
This change solves this problem by not storing into
bucket_bandwidth_rollups other actions then GET and PUT. Those actions
are only important from bucket bandwidth usage perspective because
those are actions performed by users. Other actions (repair, audit or
GE) are also stored in storagenode_bandwdith_rollups so we will still
have access to them e.g. for statistic purposes.
https://github.com/storj/storj/issues/5332
Change-Id: Ibb5bf0a4c869b0439dc65da1c9342a38ca2890ba
Sorting by primary key before inserting data into DB is fixed.
Earlier we were sorting input slice of BucketBandwidthRollup but then
we were putting all entries into map to rollup input data. Iteration
over map with a range loop doesn't guarantee any specific order so we
were loosing sorted order when we were creating with this map slices to
use with DB insert.
New code is also using map but when map is full its sorting map keys
separately and iterates over them to get data from map.
https://github.com/storj/storj/issues/5332
Change-Id: I5bf09489b0eecb6858bf854ab387b660124bf53f
When we have problem with inserting bandwidth amounts into cache
or DB we are logging information about it but log entries are not
very detailed. This change adds bandwidth amounts to the log entry.
https://github.com/storj/storj/issues/5470
Change-Id: I55ccad837d17b141501d3def1dec7ad5f3acdb0b
* use the same DB application name for satellite and metabase
* use noop orders DB implementation to avoid storing allocated bandwidth
in DB
Change-Id: I20e88c694d38240fe1a20c45719e210cfb76402c
bucket_bandwidth_rollups table
We have performance problems with updating bucket_bandwidth_rollups. To
improve situation we can stop storing allocated bandwidth in this table.
This should reduce large number of updates which are comming from
metainfo endpoints, repair workers and audit.
Next step will be to drop `allocated` column completely from
bucket_bandwidth_rollups.
Allocated GET bandwidth is all we need and we are keeping it in
bucket_bandwidth_rollups table.
Change-Id: Ifdd26a89ba8262acbca6d794a6c02883ad0c0c9b
Currently the primary key of the underlying rollup table has the
primary key being the bucket name, but we used to sort by projectID.
This caused dead locks due to the contention during updates/inserts.
We should reevalute if bucket name being the primary key is the right
way for this table, this should stop the long running and failing attempts tho.
Change-Id: Ie7d0f86944da48ad9cbd92eb162226882a2fb954
ReverifyPiece() is not currently hooked up to anything, but is planned
to take the place of audit.(*Verifier).Reverify().
ReverifyPiece() works by downloading one piece in its entirety, rather
than pulling an entire stripe across many nodes.
Change-Id: Ie2c680f4d3c3b65273a72466a3f9f55c115b0311
Use DownloadSelectionCache to avoid querying database for every
download.
This change only addresses downloads from users. The download selection
cache is not currently used for audit and repair.
Change-Id: I96a49e121dac0b4204f97592a63131edabd73fb5
We can use PieceIDDeriver in all places where we are deriving id from
the same id multiple times. We have serveral such places: gc, segment
deletion, segment validation, order limit creation. Using it should
save some resources.
Change-Id: I24668d516c0f7cea4aec6470614067734149501d
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.
Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
inconsistency
The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.
Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
Batching of the order submissions can lead to combining the allocated
traffic totals for two completely different time windows, resulting
in incorrect customer accounting. This change will group the batched
order submissions by projectID as well as time window, leading to
distinct updates of a buckets bandwidth rollup based on the hour
window in which the order was created.
Change-Id: Ifb4d67923eec8a533b9758379914f17ff7abea32
Currently most of the satellite log is made up by this specific log
message and thus we should increase the log level. We already have
a monkit event tracking this case. In case we need to look into this
more we should just increase the satellite log level.
Change-Id: I27ebed3e6745701c66c83e8c52ddc836ad9d5f4e
When we switched to segments loop we stopped
adding bucket name and project ID to order limit
metadata. This is causing lots of logging that we don't need.
Change-Id: Id3f878a170b7a6b801e8a838ee69165715985d60
Populate the egress_dead column for taking into account allocated bandwidth that can be removed because orders have been sent by the storage nodes. The bandwidth not used in these orders can be allocated again.
Change-Id: I78c333a03945cd7330aec052edd3562ec671118e
In the situation where the flushes take longer than the incoming
rate of writes, the RollupsWriteCache will take every connection
in the database pool and use them forever. Instead of doing that
and taking down satellite availability, bound the number of flush
operations that it will perform and drop incoming writes earlier
to keep memory usage constant.
Adds monitoring events for if any flushes or updates are lost.
Change-Id: I81b169b73501ee9b999f4b03d1e79645fc56f167
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.
Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
In the past ExpirationDate was available inside CreatePutRepairOrderLimits
but this was removed since the metabase segment was missing the ExpiresAt field.
Now ExpiresAt field is available in the metabase segment
and can be set correctly while executing NewSignerRepairPut.
Change-Id: I068c07492ab27bde2c44477bbd32c5872edd024a
Sometimes we see timeouts from DNS lookups when trying to do
repair GETs. Solution: try using node's last IP and port first.
If we can't connect, retry with DNS lookup.
Change-Id: I59e223aebb436118779fb18378f6e09d072f12be
the very common case is that the node api version is
indeed at least the requested version, so query for
that first to avoid write traffic.
Change-Id: Ib047d93078205bc07fee75d1f635503b792307f0
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.
As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.
This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.
In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.
Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
Previously the object range was not used for calculating order limit.
This meant that even if you were downloading only a small range it would
account bandwidth based on the full segment.
This doesn't fully address the accounting since the lazy segment
downloads do not send their requested range nor requested limit.
Change-Id: Ic811e570c889be87bac4293547d6537a255078da
project_bandwidth_daily_rollups table
We want to calculate used bandwidth better so we need to calculate it
from allocated and settled bandwidth. To do this we need first populate
this new table.
https://storjlabs.atlassian.net/browse/PG-56
Change-Id: I308b737bf08ee48ce4e46a3605697ab2095f7257
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
The previous default FlushBatchSize of 10000 was causing major
slow down in select and insert statements on bucket_bandwidth_rollups.
We saw on the saltlake satellite that a FlushBatchSize of 1000 helped
reduce contention and query latency.
Change-Id: Ib95e73482219bc5aedc11925b1849fa5999774ba
Remove the orders Settlement endpoint because it isn't used and it was
already always returning an error.
Change-Id: I81486fbe7044a1444182173bc0693698ee7cfe7e
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
CreateGetOrderLimits is not used anymore because we have CreateGetOrderLimits2. We need to remove old method and fix name of second.
Change-Id: I59148b8d28fc9dbab7d452c884319125a02745d1
SatelliteAddress in OrderLimit is not being used anymore and some
satellite addresses may consume too much bytes.
Change-Id: Ic7a0efe5b6211c2f3b91af67b293cde98b29d074
Avoid using project uuid string representation, because
it uses more bandwidth.
This reduces the encrypted metadata size from 118 -> 97 bytes.
Change-Id: Ic53a81b83acc065f24f28cd404f9c0b1fe592594