storj

Author	SHA1	Message	Date
Egon Elbre	59e3b586e7	satellite/{gracefulexit,overlay}: enable as of system time queries Change-Id: I2af5eb0e8a51fca7893ce07b78b5633be71dfef8	2021-06-22 11:50:50 +00:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00
Ivan Fraixedes	7fb86617fc	satellite/satellitedb: Use CRDB AS OF SYSTEM & batch for GE Use the 'AS OF SYSTEM TIME' Cockroach DB clause for the Graceful Exit (a.k.a GE) queries that count the delete the GE queue items of nodes which have already exited the network. Split the subquery used for deleting all the transfer queue items of nodes which has exited when CRDB is used and batch the queries because CRDB struggles when executing in a single query unlike Postgres. The new test which has been added to this commit to verify the CRDB batch logic for deleting all the transfer queue items of the exited nodes has raised that the Enqueue method has to run in baches when CRDB is used otherwise CRDB has return the error "driver: bad connection" when a big a amount of items are passed to be enqueued. This error didn't happen with the current test implementation it was with an initial one that it was creating a big amount of exited nodes and transfer queue items for those nodes. Change-Id: I6a099cdbc515a240596bc93141fea3182c2e50a9	2021-05-07 13:09:19 -04:00
Michal Niewrzal	b3aa28cc02	satellite/gracefulexit: migrate to metabase Change-Id: I8be9cc68894124427e4a30d7631126b3afb1f281	2020-12-18 10:57:39 +00:00
Qweder93	88ff8829a1	satellite/gracefulexit: RecvTimeout increased to 2h, so slow nodes stop receiving lot of fails and as a result DQ Change-Id: Id4c8a394162ba368aeb573a927f825bf7250aa52	2020-08-24 18:59:24 +03:00
Egon Elbre	94a09ce20b	all: add missing dots Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a	2020-08-11 17:50:01 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
littleskunk	76849558cb	satellite/gracefulexit: increase performance and tolerate higher error rate Graceful exit is very slow at the moment. Over the last couple days we increase the batch size on Stefans satellite to 1000 but as a side effect the error rate was increased. With a batch size of 500 the error rate looks stable. This PR will increase the default to batch size to 300. Graceful exit will still be painful slow but at least it will be a bit faster. At the same time this PR also increases the number of errors we tolerate. We don't want to DQ slow storage nodes just because they didn't finish all 300 transfers in time. We want to give them more retries. Change-Id: I92e3f99e116d4988457d8b902a88e85ed1bcc1a7	2020-02-12 11:40:15 +00:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Natalie Ventura Villasana	aa3e183c2e	satellite/gracefulexit: add ge eligibility check Adds check to see if storage nodes are eligible to initiate graceful exit, by checking their CreatedAt date and seeing if their "age" is greater than the new config value: NodeMinAgeInMonths The default for this value is 6 months for now. https://storjlabs.atlassian.net/browse/V3-3357 Change-Id: Ib807ab8987ddb5a38a27a83886490f73fe8c5816	2019-12-31 09:31:58 -05:00
littleskunk	6ab72a6e79	satellite/gracefulexit: enable graceful exit in production Change-Id: I526ce4a4de9c318f1333b793e3167f5f86d65adc	2019-12-09 17:32:34 +00:00
Natalie Villasana	1a9757a7f2	satellite/gracefulexit: add count for order limits sent from satellite to exiting node (#3544 )	2019-11-13 09:54:50 -05:00
Yingrong Zhao	69b0ae02bf	satellite/gracefulexit: separate functional code in endpoint (#3476 )	2019-11-08 13:57:51 -05:00
Maximillian von Briesen	590312970d	satellite/gracefulexit: add flag for enabling/disabling graceful exit on the satellite (#3437 )	2019-11-01 16:21:24 +02:00
Natalie Villasana	4878135068	satellite/gracefulexit, storagenode/gracefulexit: add timeouts (#3407 )	2019-10-30 13:40:57 -04:00
Yingrong Zhao	fa1ac24e19	satellite/gracefulexit: add failure threshold check (#3329 ) * add overall failure percentage check and inactive time frame check before sending a response to sno * update comment * delete node from transfer queue if it has been inactive for too long * fix linting error * add test config value * fix nil pointer * add config value into testplanet * add unit test for overall failure threshold * move timeframe threshold to chore * update protolock * add chore test * add per peiece failure count logic * change config name from EndpointMaxFailures to MaxFailuresPerPiece * address comments * fix linting error * add error handling for no row returned from progress table * fix test for graceful exit chore on storagenode * fix typo InActive -> Inactive * improve readability for failure threshold calculation * update config lock * change error handling for GetProgress in graceful exit endpoint on the satellite side * return proper rpc error in endpoint * add check in chore test for checking finish timestamp and queue	2019-10-24 12:24:42 -04:00
Maximillian von Briesen	abb567f6ae	cmd/satellite: add graceful exit reports command to satellite CLI (#3300 ) * update lock file and add comment * add created at and bytes transferred * cleanup * rename db func to GetGracefulExitNodesByTimeFrame * fix flag * split into two overlay functions * := to = * fix test * add node not found error class * fix overlay test * suggested test changes * review suggestions * get exit status from overlay.Get() * check rows.Err * fix panic when ExitFinishedAt is nil * fix comments in cmdGracefulExit	2019-10-22 21:06:01 -04:00
Ethan Adams	a1275746b4	satellite/gracefulexit: Implement the 'process' endpoint on the satellite (#3223 )	2019-10-11 17:18:05 -04:00

18 Commits