more trustworthy downtime tracking
Detection chore: Do not update downtime at all from the detection chore.
We only want to include downtime between two explicitly failed ping attempts
(the duration between last contact success and the first failed ping is no longer
included in downtime calculation)
Estimation chore: If the satellite started after the last failed ping for a node,
do not include offline time since the last failed ping time - only
estimate based on two failed pings with no satellite downtime in
between.
This protects us from including satellite downtime in our storagenode downtime calculations.
Change-Id: I1fddc9f7255a7023e02474255d70c64faae75b8a
Sometimes the upload that is supposed to fail due to excess usage
would pass. This looks to be because it's overwriting another object
uploaded earlier in the test and deleting the old pointer. If tally
happened to run after the pointer is deleted but before the current
upload reaches the live accounting check, it might pass through.
The solution is to upload to a different path each time.
Change-Id: Ie6c825b9c6eab9ed53426ae262e7997bcb6beb7f
When system restarts, local dns resolver may not be ready before our
application starts up. Adding a dependency for dns service will help
prevent dns lookup not available error for storagenode on system reboot.
Change-Id: Ie4be2813736e377df551fd8190f2247d3ae05ccd
In the methods we use to retrieve a user's chargeable BW, we were summing GET, GET_AUDIT,
and GET_REPAIR. We only want to charge for GET
Change-Id: Icead7695494b22c7c835482cf8b1512a980d59f1
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
This test checks that we are actually walking over the pieces when
starting the cache, and that it is returning expected values.
A recent outage was partially caused by the fact that this cache was
accidentally reading itself (via the pieces store, which has the cache
embedded). This test ensures that does not happen, and checks that when
the cache's `Run` method is called, the space used values are read from
disk and accurately update the cache.
Change-Id: I9ec61c4299ed06c90f79b17de3ffdbbb06bc502e
As a workaround it was set to 0 in previous release. Now according to the TOC must be set to 500GB.
Change-Id: Ia2743d49e86683396958aff51b95df743af4f872
http.FileServer relies on mime types defined in the operating system.
These values may be misconfigured, so a javascript file might
end up being served as "plain/text".
Change-Id: I3c13c8a9ac484bd765a4de0f8253bfe40dde7513
Currently the whole satellite diagram can be quite overwhelming.
This change makes graphs for api, core and repair processes separately.
Change-Id: Iea906f51c3bcc46c71d7c8f6d8964034b317b3b4
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
We move PathCipher to encryption.Store and we need to adjust
storj/uplink for those changes. Uplink repo is also using libuplink to
run tests so we need first adjust storj/storj libuplink and later
storj/uplink.
Change-Id: I84f23e6bad18ac139f72c19939dc526f9f46d88b
instead of aborting on the first error, so that we can hit all
satellites and get the best numbers we can
Change-Id: I21d5163884940612d7d39eaf73a6fac07235cd9e
We have added a bug with v0.31.7 and deploying it would kick out all the
storage nodes that are full. Easy fix is setting the requirment to 0.
That will allow them to still start up even if they are full.
Change-Id: Ie66f369952d929fcfd47f44f6e5e57eea8f51ff6
satellite api during rolling upgrade test
The old api is using the same config file as the new satellite in the
rolling upgrade test, so we need to set it to something different so
that there is no conflict when we spin up a new storj-sim instance while
the old api is running concurrently.
Change-Id: Ia4ec2db4953f36f43275495710992831ad3916a2
Control Panel allows to control different chores and services.
Currently this adds controlling of cycles.
Change-Id: I734f1676b2a0d883b8f5ba937e93c45ac1a9ce21
By separating Server it allows Peers to directly embed the server
and provide customizations and hooks into rest of the services.
Change-Id: Ic1d68740fd494d2f82c1739bd990849c561b912b