storj

Author	SHA1	Message	Date
Kaloyan Raev	1f386db566	cmd/satellite: remove metainfo commands (#3955 )	2020-10-22 13:33:09 +03:00
Kaloyan Raev	1aeb14e65e	satellite/audit: do not delete expired segments A year ago we made the audit service deleting expired segments. Meanwhile, we introduced an expired deletetion sub-service in the metainfo service which sole purpose is deleting expired segments. Therefore, now we are removing this responsibility from the audit service. It will continue to avoid reporting failures on expired segments, but it would not delete them anymore. We do this to cleanup responsibilities in advance of the metainfo refactoring. Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4	2020-10-22 08:24:16 +00:00
paul cannon	360ab17869	satellite/audit: use LastIPAndPort preferentially This preserves the last_ip_and_port field from node lookups through CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later calls to (Verifier).GetShare() can try to use that IP and port. If a connection to the given IP and port cannot be made, or the connection cannot be verified and secured with the target node identity, an attempt is made to connect to the original node address instead. A similar change is not necessary to the other CreateOrderLimits functions, because they already replace node addresses with the cached IP and port as appropriate. We might want to consider making a similar change to CreateGetRepairOrderLimits(), though. The audit situation is unique because the ramifications are especially powerful when we get the address wrong. Failing a single audit can have a heavy cost to a storage node. We need to make extra effort in order to avoid imposing that cost unfairly. Situation 1: If an audit fails because the repair worker failed to make a DNS query (which might well be the fault on the satellite side), and we have last_ip_and_port information available for the target node, it would be unfair not to try connecting to that last_ip_and_port address. Situation 2: If a node has changed addresses recently and the operator correctly changed its DNS entry, but we don't bother querying DNS, it would be unfair to penalize the node for our failure to connect to it. So the audit worker must try both last_ip_and_port _and_ the node address as supplied by the SNO. We elect here to try last_ip_and_port first, on the grounds that (a) it is expected to work in the large majority of cases, and (b) there should not be any security concerns with connecting to an out-or-date address, and (c) avoiding DNS queries on the satellite side helps alleviate satellite operational load. Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200	2020-10-21 13:34:40 +00:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Kaloyan Raev	e7f2ec7ddf	satellite/audit: fix sanity check for verify-piece-hashes command The VerifyPieceHashes method has a sanity check for the number pieces to be removed from the pointer after the audit for verifying the piece hashes. This sanity check failed when we executed the command on the production satellites because the Verify command removes Fails and PendingAudits nodes from the audit report if piece_hashes_verified = false. A new temporary UsedToVerifyPieceHashes flag is added to audits.Verifier. It is set to true only by the verify-piece-hashes command. If the flag is true then the Verify method will always include Fails and PendingAudits nodes in the report. Test case is added to cover this use case. Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8	2020-10-07 17:17:48 +03:00
Kaloyan Raev	b409b53f7f	cmd/satellite: command for verifying piece hashes Jira: https://storjlabs.atlassian.net/browse/PG-69 There are a number of segments with piece_hashes_verified = false in their metadata) on US-Central-1, Europe-West-1, and Asia-East-1 satellites. Most probably, this happened due to a bug we had in the past. We want to verify them before executing the main migration to metabase. This would simplify the main migration to metabase with one less issue to think about. Change-Id: I8831af1a254c560d45bb87d7104e49abd8242236	2020-09-29 10:58:24 +00:00
Michal Niewrzal	aa47e70f03	satellite/metainfo: use metabase.SegmentKey with metainfo.Service Instead of using string or []byte we will be using dedicated type SegmentKey. Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220	2020-09-03 15:11:32 +00:00
Egon Elbre	3ca405aa97	satellite/orders: use metabase types as arguments Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef	2020-08-28 15:52:37 +03:00
Moby von Briesen	4f28bf0720	satellite/audit: Do not return errors from Verify or Reverify on segment modified, expired, or deleted If a segment is deleted, is modified, or expires during an audit, this is not problematic, so we should not return errors. Functionally, nothing changes, but our metrics around audit success rate will be improved after this change. Change-Id: Ic11df056b2c73894b67a55894bd4d58c00470606	2020-08-26 13:24:00 +00:00
Qweder93	01bb2bd17d	satellite/audit: verifier checks if node made sucess GE before auditing Change-Id: Ia6cde4e9fcf11020a5301d38065f7159f276eb80	2020-08-17 23:37:57 +03:00
Egon Elbre	94a09ce20b	all: add missing dots Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a	2020-08-11 17:50:01 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
Egon Elbre	ed627144ed	all: use DialNodeURL throughout the codebase Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c	2020-05-20 10:36:30 +00:00
paul cannon	0c8c11b251	satellite/audit: add not_enough_shares_for_audit counter We have been using the SQL expression `name='(*Verifier).Verify' AND error_name='not enough shares for successful audit'` thus far to detect cases of this problem and alert on them. Unfortunately, since this rarely (hopefully never) happens, influxdb has no data for most of the auditor instances, and when it has no data for a time series, it returns no columns either. This makes Redash upset when it tries to perform a query for an alert and can't find the column whose value it expects to check. This change should make it so zero values are reported when the problem has not happened, and higher values when it has. Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1	2020-04-03 17:00:50 +00:00
littleskunk	23e5a0471f	satellite/audit: clean up logging (#3832 ) Co-authored-by: Ivan Fraixedes <ivan@fraixed.es>	2020-03-30 12:09:50 -06:00
Bill Thorp	94c11c5212	satellite: remove some unnecessary UTC() calls Fixes some easy cases of extraneous UTC() calls Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a	2020-03-13 13:49:44 +00:00
Jennifer Johnson	0d60c1a4b2	satellite/audit: fix checkSegmentAltered to detect segments that have changed during an audit - Previously, checkSegmentAltered only checked for segments that were replaced but we want to detect all changes to a segment that occurred while an audit was being conducted. - Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified segments were not being removed from containment mode. - Filled in gaps in reverify testing to ensure nodes are properly removed from containment. Change-Id: Icd96d369278987200fd28581395725438972b292	2020-03-05 19:05:39 +00:00
Moby von Briesen	6043d01c90	satellite/audit/verifier: add metric for number of successfully downloaded shares Change-Id: Ia4f1dc6e088db802e340aaecf80cc7ef6dc237a4	2020-02-27 14:33:59 +00:00
Egon Elbre	5342dd9fe6	go.mod: update uplink Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811	2020-02-21 16:08:12 +02:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Egon Elbre	082ec81714	uplink: move to storj.io/uplink (#3746 )	2020-01-08 15:40:19 +02:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
Moby von Briesen	ab777e823e	do not update pointer for failed audits Change-Id: If88dce8928db28d6f53c3dc771e14ea97aae9661	2019-12-16 10:50:54 -05:00
Egon Elbre	72d407559e	satellite/metainfo: don't leak error implementation detail (#3722 ) * satellite/metainfo: don't leak implementation detail * add missing wrap	2019-12-10 15:21:30 -05:00
Maximillian von Briesen	8653dda2b1	satellite/audit: do not contain nodes for unknown errors (#3592 ) * skip unknown errors (wip) * add tests to make sure nodes that time out are added to containment * add bad blobs store * call "Skipped" "Unknown" * add tests to ensure unknown errors do not trigger containment * add monkit stats to lockfile * typo * add periods to end of bad blobs comments	2019-11-19 17:30:28 +01:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00
Egon Elbre	cc032d3151	satellite/metainfo: fix some uses of metainfo.Delete (#3513 ) * satellite/metainfo: rename Delete to UnsynchronizedDelete * fix deletes * make db private * fix typos * also verify on commit object	2019-11-06 18:02:14 +01:00
Maximillian von Briesen	7cdc1b351a	satellite/audit: do not audit expired segments (#3497 ) * during audit Verify, return error and delete segment if segment is expired * delete "main" reverify segment and return error if expired * delete contained nodes and pointers when pointers to audit are expired * update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now * Revert "update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now" This reverts commit e9066151cf84afbff0929a6007e641711a56b6e5. * do not count ExpirationDate=time.Time{} as expired	2019-11-05 20:41:48 +01:00
littleskunk	eeb38245ff	satellite/audit: improve logging (#3285 )	2019-10-16 13:48:05 +02:00
JT Olio	a5d1776539	audits: missing continue Change-Id: Ifcac8e61ebd8c59407e01c791adc60d9f88ff1b7	2019-10-11 13:55:27 -06:00
Maximillian von Briesen	784ca1582a	satellite/audit: fix audit panic (#3217 )	2019-10-09 10:06:58 -04:00
Maximillian von Briesen	3a3d576d9b	satellite/audit: add mutex to pieceHashesVerified map (#3214 ) * add mutex * remove double send to ch * lock mutex inside defer * import sync	2019-10-08 17:01:32 -04:00
littleskunk	c009543236	satellite/audit: Add piece hash verified to log messages (#3204 )	2019-10-08 12:51:57 +02:00
Maximillian von Briesen	e1b7d01160	satellite/audit: do not fail or contain nodes for audited segments that are not piece-hash-verified (#3161 )	2019-10-07 16:06:10 -04:00
Maximillian von Briesen	edadf46009	satellite/audit: delete nodes from containment when segment has changed (#3115 )	2019-09-29 04:03:15 +02:00
Jeff Wendling	098cbc9c67	all: use pkg/rpc instead of pkg/transport all of the packages and tests work with both grpc and drpc. we'll probably need to do some jenkins pipelines to run the tests with drpc as well. most of the changes are really due to a bit of cleanup of the pkg/transport.Client api into an rpc.Dialer in the spirit of a net.Dialer. now that we don't need observers, we can pass around stateless configuration to everything rather than stateful things that issue observations. it also adds a DialAddressID for the case where we don't have a pb.Node, but we do have an address and want to assert some ID. this happened pretty frequently, and now there's no more weird contortions creating custom tls options, etc. a lot of the other changes are being consistent/using the abstractions in the rpc package to do rpc style things like finding peer information, or checking status codes. Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412	2019-09-25 15:37:06 -06:00
Maximillian von Briesen	a4048fd529	satellite/audit: fix containment mode (#3085 ) * add test to make sure we will reverify the share in the containment db rather than in the pointer passed into reverify * use pending audit information only when running reverify	2019-09-19 01:45:15 +02:00
Natalie Villasana	aa3567187e	satellite/audit: worker now verifies and reverifies (#2965 )	2019-09-11 18:37:01 -04:00
Egon Elbre	a801fab66a	all: add archview annotations (#2964 )	2019-09-10 16:24:16 +03:00
Yingrong Zhao	8eda360ad3	add segment path into logs (#2898 )	2019-08-29 08:38:26 -04:00
Egon Elbre	2d69d47655	all: fix Error.New formatting (#2840 )	2019-08-21 19:30:29 +03:00
Ivan Fraixedes	87f3b6c708	satellite/audit: Improve comments in verifier (#2829 ) Improve some source code comments in the verifier.	2019-08-20 10:23:14 -04:00
Egon Elbre	c8edeb0257	satellite/overlay: rename overlay.Cache to overlay.Service (#2717 )	2019-08-06 19:35:59 +03:00
Egon Elbre	5d0816430f	rename all the things (#2531 ) * rename pkg/linksharing to linksharing * rename pkg/httpserver to linksharing/httpserver * rename pkg/eestream to uplink/eestream * rename pkg/stream to uplink/stream * rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo * rename pkg/auth/signing to pkg/signing * rename pkg/storage to uplink/storage * rename pkg/accounting to satellite/accounting * rename pkg/audit to satellite/audit * rename pkg/certdb to satellite/certdb * rename pkg/discovery to satellite/discovery * rename pkg/overlay to satellite/overlay * rename pkg/datarepair to satellite/repair	2019-07-28 08:55:36 +03:00

44 Commits