the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.
Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
We did not test the SegmentHealth function with actual production
values, and it turns out that values such as 52 healthy, 35 minimum
result in +Inf segment health - so pretty much all segments put into the
repair queue have the same health, which means we effectively aren't
sorting by health.
This change inserts numHealthy as segment health into the database so
the segments are ordered as they were before. We need to refine the
SegmentHealth function before we can support multi RS.
Change-Id: Ief19bbfee3594c5dfe94ca606bc930f05f85ff74
Because the PieceTracker receives a piece count per nodes which is an
approximation of the number of nodes that they are going to be reported
by the metainfo loop so we can use as a good guess of the map's size and
initialized with it.
Change-Id: I644db40926c03e4c457457fb41d2ec1da059cea6
MetadataSize can slightly vary and checking for exact value makes
difficult to change what's being encoded in metadata.
Change-Id: I5f1ade41bc26d115e6743367ee35cf1ba74795c9
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.
For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.
Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
Currently flag parsing seems to call Set twice, which causes problems
with encryption keys. We can clear for every set for now.
Change-Id: Id5c695b4020194ac1c50a2da9c7d2a896cb9216f
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.
The new format for RS override values is
"k/o/n-override,k/o/n-override..."
Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
Before manipulating order information on storagenodes we need to wait
for the orders to propagate to the database. Some of that happens
async with uplink.
Change-Id: Iaacfd7db0909ab5d2831d06388e5fb27b6d4778f
Firstly, this changes the repair functionality to return Canceled errors
when a repair is canceled during the Get phase. Previously, because we
do not track individual errors per piece, this would just show up as a
failure to download enough pieces to repair the segment, which would
cause the segment to be added to the IrreparableDB, which is entirely
unhelpful.
Then, ignore Canceled errors in the return value of the repair worker.
Apparently, when the worker returns an error, that makes Cobra exit the
program with a nonzero exit code, which causes some piece of our
deployment automation to freak out and page people. And when we ask the
repair worker to shut down, "canceled" errors are what we _expect_, not
an error case.
Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.
Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
This PR does the following three things:
1. Defines a high-level interface for this wasm package
- All return value from this package will be wrapped with an
result object that contains a value field and an error field
2. Exposes two new functions to allow users to add permissions for a
given API key
- newPermission()
- setAPIKeyPermission()
3. Adds API documentation for the newly added API functions
Change-Id: Id995189702b369bba18fa344bef4ddfb0f3f1f44
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
A change was made to use a metabase.SegmentKey (a byte slice alias)
as the last seen item to iterate through the irreparable DB in a
for loop. However, this SegmentKey was not initialized, thus it was
nil. This caused the DB query to return nothing, and healthy segments
could not be cleaned out of the irreparable DB.
Change-Id: Idb30d6fef6113a30a27158d548f62c7443e65a81
WHAT:
change user's email endpoint and appropriate service method was implemented
WHY:
make it possible to change user's email for temporary filezilla account
Change-Id: Ieea41bf49819a42b5f433e8dfaeec24c6d5ddc9f
Storage nodes undergoing Graceful Exit have up to now been receiving
hostnames for all other storage nodes they need to contact when
transferring pieces. This adds up to a lot of DNS lookups, which
apparently overwhelm some home routers. There does not seem to be any
need for us to send hostnames for graceful exit as opposed to IP
addresses; we already use IP addresses (as given by the last_ip_port
column in the nodes table) for all the GET and PUT orders we send out.
This change causes IP addresses to be used instead.
I started trying to construct a test to ensure that the behavior
changed, but it was rabbit-holing, so I've begun to feel that maybe this
change doesn't require one; it is a very simple change, and very much of
the same nature as what we already do for IPs in CreateGetOrderLimits
and CreatePutOrderLimits (and others).
Change-Id: Ib2b5ffe7a9310e9cdbe7464450cc7c934fa229a1
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
After moving SatStreamID and SatSegmentID from common I missed changing
some methods in metainfo endpoint. This change is a fix for that.
Change-Id: I34e121fce47371ee4cfd92cce03809520b68859f
We have some types that are only valid for satellite usage. Such types
are SatStreamID and SatSegmentID. This change moves those types to
storj/storj and adds basic infrastructure for generating code.
Change-Id: I1e643844f947ce06b13e51ff16b7e671267cea64
Some of metainfo endpoint methods are not used but we still have
implementation there. This change removes unused code and returns
unimplemented error for those methods.
Change-Id: I74e75e0caff76a4f5d119ee989b687b4e9d6e6f9
This change removed unused 'createRequests' struct. As far I remember it
was used to help validating old metainfo beginObject/commitObject flow.
Change-Id: I0f139b9934196d73f26eafa347ba5605722f3a55
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.
After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service
It would require additional refactoring in these two services before we
are able to clean this.
Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.
Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355