storj

Author	SHA1	Message	Date
Yingrong Zhao	6c34ff64ad	satellite/satellitedb: remove referrence to audit information in nodes and audit_history tables This PR removes all code reference to audit_histories table and ``` audit_reputation_alpha, audit_reputation_beta, unknown_audit_reputation_alpha, unknown_audit_reputation_beta, ``` columns from nodes table. It also drops audit_histories table from the db since the code that's referencing it currently are not being used. Change-Id: Ifcda8db36afb3a333d487ff831f2fdefc8b02a4c	2021-08-13 21:11:28 +00:00
Yingrong Zhao	e4cc965c39	satellite/satellitedb: replace explicit transaction with dbx query for UpdateReputation Change-Id: I7c139ededea83d4b58107536c3a031c4f92d6eb4	2021-08-05 17:09:49 +00:00
Egon Elbre	65804801ec	all: fix mon.Task leak Change-Id: Ifd58c7ac5631b9c3c750b3f4cc50525167e90709	2021-08-05 14:07:45 +03:00
Yingrong Zhao	646ce5b8cc	satellite/overlay: remove reputation logic from overlay Change-Id: I3492860e4537c7a8e4e824ec4c9c8d179134a0c0	2021-07-28 15:15:28 -04:00
Yingrong Zhao	f8914ccce0	satellite/{repair, overlay}: use reputation store in repair Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a	2021-07-28 13:22:05 -04:00
Yingrong Zhao	e91574cee1	satellite/{reputation, gracefulexit}: use reputation store in gracefulexit With the effort to move audit related data into reputation store, this PR updates gracefulexit endpoint to use reputation service to get a node's audit score Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a	2021-07-28 13:21:41 -04:00
Yingrong Zhao	6c7bf357cd	satellite/{reputation,audit,overlay}: replace overlay with reputation package in audit This PR implements reputation store and replace overlay in audit service to use such store for storing node's audit stats. In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to reputation package. In a following PR, the duplicating code will be removed from overlay. Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1	2021-07-28 13:10:48 -04:00
Cameron Ayer	8c124c6fa4	satellite/{reputation,overlay,satellitedb}: create reputation service, DB, add overlay method UpdateReputation Define service and DB interface for storing node reputation data and updating the overlay cache. Add overlay service and DB method UpdateReputation. See https://github.com/storj/storj/pull/4144 Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c	2021-06-24 16:19:15 +00:00
Cameron Ayer	53322bb0a7	satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats Until now, whenever audits were recorded we would try to delete the node from containment just in case it exists. Since we now want to treat segment repair downloads as audits, this would erroneously remove nodes from containment, as repair does not go through a Reverify step. With this changeset, (Batch)UpdateStats will not remove nodes from containment. The Reverify method will remove all necessary nodes from containment. Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f	2021-06-01 21:02:44 +00:00
Egon Elbre	cdcc67207c	satellite/satellitedb: fix nil panic in UpdateCheckIn Change-Id: If6ae2c3d9b7c269b0a9d652e68854091f668b5ec	2021-05-25 00:30:36 +03:00
Egon Elbre	8f15f975a2	satellite/overlay: improve contended update checkin Improve UpdateCheckIn on a contended row: name old time/op new time/op delta UpdateCheckInContended-100x-32 2.29s ±55% 0.17s ±61% -92.45% (p=0.008 n=5+5) Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c	2021-05-16 20:41:12 +03:00
Egon Elbre	0858c3797a	satellite/{metabase,satellitedb}: deduplicate AS OF SYSTEM TIME code Currently we were duplicating code for AS OF SYSTEM TIME in several places. This replaces the code with using a method on dbutil.Implementation. As a consequence it's more useful to use a shorter name for implementation - 'impl' should be sufficiently clear in the context. Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish between the two modes is useful and slightly shorter without causing confusion. Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d	2021-05-11 12:40:36 +03:00
Cameron Ayer	bb343d9028	satellite/satellitedb: don't remove offline nodes from containment When audits are being recorded, we automatically add some SQL to remove the node from the pending audits table in case it exists. They are removed from pending audits even if the node was offline for the audit. This is not the correct behavior. Add statement to record audit results in reverify tests to ensure no more false positives. Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17	2021-05-03 16:05:55 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Egon Elbre	a2e20c93ae	private/dbutil: use dbutil and tagsql from storj.io/private Initially we duplicated the code to avoid large scale changes to the packages. Now we are past metainfo refactor we can remove the duplication. Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef	2021-04-23 14:36:52 +03:00
Cameron Ayer	a0c5da6643	satellite/satellitedb: in stray nodes DQ, don't DQ nodes where last_contact_success = '0001-01-01 00:00:00+00' When nodes check in for the very first time, if the satellite can't ping them back, they are inserted into the nodes table with last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes chore runs before the node can fix their problem, they are DQd. Solution: when DQing stray nodes, dont DQ where last_contact_success = '0001-01-01 00:00:00+00'::timestamptz Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8	2021-04-22 10:13:13 -04:00
Jeff Wendling	a65aecfd98	compensation: always generate invoices for every node instead of only generating invoices for nodes that had some activity, we generate it for every node so that we can find and pay terminal nodes that did not meet thresholds before we recognized them as terminal. Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44	2021-03-29 14:15:45 +00:00
Cameron Ayer	05f8d2d0b1	satellite/satellitedb: filter offline suspended nodes from selection Change-Id: I5a6f413453332238d579a7bf50eb30e9156f96c2	2021-03-27 23:36:46 +00:00
Cameron Ayer	1a51049ac0	satellite/{overlay,satellitedb}: add flag to toggle suspending nodes for offline audits This change introduces a new config flag, --overlay.audit-history.offline-suspension-enabled, to toggle suspending nodes for offline audits. If the flag is set to true, nodes will be suspended if they meet the requirements. If the flag is false, nodes will not be suspended. If they are already suspended and/or under review, these will be cleared. Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74	2021-03-27 16:28:27 +00:00
Cameron Ayer	eb44dc21b4	satellite/satellitedb: select stray nodes for DQ in separate tx from update Previously we would select a limited number of nodes for DQ in a CTE and run the update on that set in a single transaction. This could lead to locking on the table, so instead we select and update in separate transactions. Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1	2021-03-27 00:00:23 +00:00
Cameron Ayer	2607b16070	satellite/{overlay/straynodes,satellitedb}: rework DQNodesLastSeenBefore to return DQd node IDs and last contact successes We would like to log Node IDs and last contact successes of nodes DQd in this manner. We would also like to avoid returning an unbounded list of items from the db. Therefore we change the query to select a limited number of nodes that meet the DQ conditions and iterate until 0 rows are returned. Each column of the query is already indexed. Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee	2021-03-22 13:01:30 -04:00
Cameron Ayer	aeac6264cd	sallite/satellitedb: add metric stray_nodes_dq_count Add metric so we can see how many nodes are DQd due to this. Change-Id: Ie4bdd1375fb9bd948af14fed9a2962b783b6a526	2021-03-01 21:06:36 +00:00
Cameron Ayer	549033f2e6	satellite/satellitedb: don't include DQd and exited nodes in DQStrayNodes Don't update DQ time of already DQd nodes. Don't DQ nodes who exited. Change-Id: I4528a9ba9f8e278987165ad337a9b34dadb9788b	2021-02-19 15:12:30 -05:00
JT Olio	b2ed7edd30	cmd/satellite: restore-trash parallel workers Change-Id: Ic7466b21c20bda334e7ba4268a494e96b6528ac1	2021-02-18 19:11:19 +02:00
JT Olio	3ae3389ddc	cmd/satellite: restore-trash command Change-Id: I80fc932c12147692d49cde277784871ac611fcad	2021-02-18 09:19:22 -07:00
Yaroslav Vorobiov	966535e9de	{storagenode,satellite}/nodeoperator: add wallet features Change-Id: Iac7eb40a52b8fddcc573aebaad2e3a30a10cded9	2021-02-08 22:09:45 +02:00
Cameron Ayer	a17934cb51	satellite/satellitedb: remove reference to uptime counts Change-Id: I26ac540b720a8ba5d6ca44526900228352dcaf4e	2021-02-02 14:51:27 -05:00
Egon Elbre	54e01d37f9	satellite/overlay: add DownloadSelectionCache Change-Id: Ic0779280172325f8d03f55a2e9673722f72bdd44	2021-01-29 16:47:06 +02:00
Cameron Ayer	d14607a5f7	satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de	2021-01-19 15:43:02 -05:00
Cameron Ayer	75d828200c	private,satellite: add chore to dq stray nodes Full scope: private/testplanet,satellite/{overlay,satellitedb} Description: In most cases, downtime tracking with audits will eventually lead to DQ for nodes who are unresponsive. However, if a stray node has no pieces, it will not be audited and will thus never be disqualified. This chore will check for nodes who have not successfully been contacted in some set time and DQ them. There are some new flags for toggling DQ of stray nodes and the timeframes for running the chore and how long nodes can go without contact. Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615	2021-01-19 14:21:56 -05:00
Cameron Ayer	0403e99a5b	satellite/{overlay,satellitedb}: remove unused methods for old downtime tracking GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods which were only used by the previous downtime tracking system which has been removed. These methods should also be removed. Change-Id: Idb829d742e1f987e095604423fff656fe581183e	2021-01-11 15:21:28 +00:00
Moby von Briesen	6e2ef3b9ee	Revert "satellite/satellitedb: Do not consider nodes with offline_suspended as reputable." This reverts commit `e24262c2c9`. Change-Id: I287deb2e52d03bbd698ed055f0f216b0b5bf2798	2021-01-04 14:28:37 +00:00
Moby von Briesen	825dc71227	satellite/{overlay, satellitedb}: Refactor audit history * Separate audit history interface into its own file in the overlay package * Add overlay.AuditHistory struct so that internalpb.AuditHistory is only used from within the database layer * Add overlay.GetAuditHistory function for features that will require access to detailed audit history information * Do not return full audit history from UpdateAuditHistory - callers to that function only need to know the online score and whether a full tracking period has been completed * Move audit history tests out of satellite/satellitedb, since they are independent of database implementation Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd	2020-12-29 18:50:22 +00:00
Moby von Briesen	e24262c2c9	satellite/satellitedb: Do not consider nodes with offline_suspended as reputable. Nodes which are offline_suspended will no longer be considered for new uploads. The current threshold that enters a node into offline suspension is 0.6. Disqualification for offline suspension is still disabled. Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a	2020-12-29 17:59:09 +00:00
Ethan Adams	6070018021	satellite/overlay: use AS OF SYSTEM TIME with Cockroach Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.	2020-12-22 21:07:07 +02:00
Ethan	5dc013d3bd	satellite/overlay: Add retry to all selects in overlaycache Change-Id: I0356d71a35701f8e0ca04a34b2bb2aea666c1394	2020-11-29 16:46:57 -05:00
JT Olio	0ba516d405	satellite: support pointing db components at different databases the immediate need is to be able to move the repair queue back out of cockroach if we can't save it. Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326	2020-11-28 18:39:16 +00:00
Cameron Ayer	dc67ce74c9	satellite: remove IsUp field from overlay.UpdateRequest With the new overlay.AuditOutcome type for offline audits, the IsUp field is redundant. If AuditOutcome != AuditOffline, then the node is online. In addition to removing the field itself, other changes needed to be made regarding the relationship between 'uptime' and 'audits'. Previously, uptime and audit outcome were completely separated. For example, it was possible to update a node's stats to give it a successful/failed/unknown audit while simultaneously indicating that the node was offline by setting IsUp to false. This is no longer possible under this changeset. Some test which did this have been changed slightly in order to pass. Also add new benchmarks for UpdateStats and BatchUpdateStats with different audit outcomes. Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b	2020-11-02 15:34:17 -05:00
Egon Elbre	11338e9beb	satellite/internalpb: move audithistory.pb Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4	2020-10-30 15:30:11 +02:00
Cameron Ayer	bb7be23115	satellite/{audit,overlay,satellitedb}: enable reporting offline audits - Remove flag for switching off offline audit reporting. - Change the overlay method used from UpdateUptime to BatchUpdateStats, as this is where the new online scoring is done. - Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same method to record offline audits as success, failure, and unknown, we need to distinguish offline audits from the rest. Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355	2020-10-27 10:44:46 +00:00
Moby von Briesen	7c3afe164b	satellite/overlay: uncomment dq for offline and disable with feature flag Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27	2020-10-16 12:55:16 +00:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Cameron Ayer	b39a99bae6	satellite/{overlay,satellitedb}: always show node's real online score Previously if a node did not have audit history data for each of the windows over the tracking period, we would give them the benefit of the doubt and set their score to 1. This was to prevent nodes from being suspended right out the gate. We need a minimum amount of data to evaluate them. However, a node who is actually failing at being online will have no idea until they have received enough audits and we suspend them. Instead, we will always use their real score, but use a flag to determine whether they are eligible for suspension/dq. Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf	2020-10-02 12:28:11 -04:00
Jennifer Johnson	4e2413a99d	satellite/satellitedb: uses vetted_at field to select for reputable nodes Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1. This is because many tests relied on selecting nodes that were reputable based on audit and uptime counts of 0, in effect, selecting new nodes as reputable ones. However, since reputation is now indicated by a vetted_at db field that is explicitly set rather than implied by audit and uptime counts, it would be more complicated to try to update all of the nodes' reputations before selecting nodes for tests. Now we just allow all test nodes to be new if needed. Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d	2020-09-04 16:45:32 +00:00
Moby von Briesen	2d01dd9732	satellite/satellitedb: Add online_score column to nodes table Add online score used for the new audit history offline tracking system to the nodes table. This allows us easy access to the node's online score for the storagenode dashboard as well as for data analysis. Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410	2020-08-31 15:07:07 +00:00
Moby von Briesen	60a95d0dc9	satellite/{satellitedb,overlay}: Enable offline suspension and review period When a node's audit history "online score" passes below a configured threshold, the node goes into "offline suspension" mode and begins a review period, where the operator is given an opportunity to bring their node back online. After the review period passes, offline suspension is turned off for the node. In the future, if a node still has a bad online score at the end of the review period, it will be disqualified. This is disabled right now. In the future, if a node is in offline suspension, it will be treated as "unhealthy". Right now, there are no consequences for being in offline suspension. Minor changes: * Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and into UpdateRequest. * Adds "now" argument to UpdateStats/BatchUpdateStats args for easy testing. * Changes formatting strings inside buildUpdateStatement to use specific types. Change-Id: I032b60298840fc16e6ef831da750f2d57619a397	2020-08-28 16:35:48 +00:00
Moby von Briesen	959cd5cd83	satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7	2020-08-20 22:46:28 +00:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
stefanbenten	257855b5de	all: replace == comparison with errors.Is Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714	2020-07-14 15:50:25 +00:00
paul cannon	bbdb351e5e	all: use jackc/pgx in place of lib/pq What: Use the github.com/jackc/pgx postgresql driver in place of github.com/lib/pq. Why: github.com/lib/pq has some problems with error handling and context cancellations (i.e. it might even issue queries or DML statements more than once! see https://github.com/lib/pq/issues/939). The github.com/jackx/pgx library appears not to have these problems, and also appears to be better engineered and implemented (in particular, it doesn't use "exceptions by panic"). It should also give us some performance improvements in some cases, and even more so if we can use it directly instead of going through the database/sql layer. Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044	2020-07-13 15:54:41 +00:00

1 2 3 4

184 Commits