storj/scripts/testdata/satellite-config.yaml.lock

602 lines
18 KiB
Plaintext
Raw Normal View History

# admin peer http listening address
# admin.address: ""
# how often to run the reservoir chore
# audit.chore-interval: 24h0m0s
# max number of times to attempt updating a statdb batch
# audit.max-retries-stat-db: 3
# limit above which we consider an audit is failed
# audit.max-reverify-count: 3
# the minimum acceptable bytes that storage nodes can transfer per second to the satellite
# audit.min-bytes-per-second: 128 B
# the minimum duration for downloading a share from storage nodes before timing out
# audit.min-download-timeout: 5m0s
# how often to recheck an empty audit queue
# audit.queue-interval: 1h0m0s
# number of reservoir slots allotted for nodes, currently capped at 3
# audit.slots: 3
# number of workers to run audits on paths
# audit.worker-concurrency: 1
2019-06-04 13:13:31 +01:00
# how frequently checker should check for bad segments
# checker.interval: 30s
# how frequently irrepairable checker should check for lost pieces
# checker.irreparable-interval: 30m0s
# how stale reliable node cache can be
# checker.reliability-cache-staleness: 5m0s
# override value for repair threshold
# checker.repair-override: 0
# percent of held amount disposed to node after leaving withheld
compensation.dispose-percent: 50
# rate for data at rest per GB/hour
compensation.rates.at-rest-gb-hours: "0.00000205"
# rate for audit egress bandwidth per TB
compensation.rates.get-audit-tb: "10"
# rate for repair egress bandwidth per TB
compensation.rates.get-repair-tb: "10"
# rate for egress bandwidth per TB
compensation.rates.get-tb: "20"
# rate for repair ingress bandwidth per TB
compensation.rates.put-repair-tb: "0"
# rate for ingress bandwidth per TB
compensation.rates.put-tb: "0"
# comma separated monthly withheld percentage rates
compensation.withheld-percents: 75,75,75,50,50,50,25,25,25,0,0,0,0,0,0
# url link for account activation redirect
# console.account-activation-redirect-url: ""
# server address of the graphql api gateway and frontend app
# console.address: :10100
# auth token needed for access to registration token creation endpoint
# console.auth-token: ""
2019-05-28 15:32:51 +01:00
# secret used to sign auth tokens
# console.auth-token-secret: ""
# url link to contacts page
# console.contact-info-url: https://forum.storj.io
# external endpoint of the satellite if hosted
# console.external-address: ""
# allow domains to embed the satellite in a frame, space separated
# console.frame-ancestors: tardigrade.io
# url link to let us know page
# console.let-us-know-url: https://storjlabs.atlassian.net/servicedesk/customer/portals
# enable open registration
# console.open-registration-enabled: false
# used to display at web satellite console
# console.satellite-name: Storj
# name of organization which set up satellite
# console.satellite-operator: Storj Labs
# used to initialize segment.io at web satellite console
# console.segment-io-public-key: ""
# used to communicate with web crawlers and other web robots
# console.seo: "User-agent: *\nDisallow: \nDisallow: /cgi-bin/"
# path to static resources
# console.static-dir: ""
# url link to terms and conditions page
# console.terms-and-conditions-url: https://storj.io/storage-sla/
# url link to sign up verification page
# console.verification-page-url: https://tardigrade.io/satellites/verify
Remove Kademlia dependencies from Satellite and Storagenode (#2966) What: cmd/inspector/main.go: removes kad commands internal/testplanet/planet.go: Waits for contact chore to finish satellite/contact/nodesservice.go: creates an empty nodes service implementation satellite/contact/service.go: implements Local and FetchInfo methods & adds external address config value satellite/discovery/service.go: replaces kad.FetchInfo with contact.FetchInfo in Refresh() & removes Discover() satellite/peer.go: sets up contact service and endpoints storagenode/console/service.go: replaces nodeID with contact.Local() storagenode/contact/chore.go: replaces routing table with contact service storagenode/contact/nodesservice.go: creates empty implementation for ping and request info nodes service & implements RequestInfo method storagenode/contact/service.go: creates a service to return the local node and update its own capacity storagenode/monitor/monitor.go: uses contact service in place of routing table storagenode/operator.go: moves operatorconfig from kad into its own setup storagenode/peer.go: sets up contact service, chore, pingstats and endpoints satellite/overlay/config.go: changes NodeSelectionConfig.OnlineWindow default to 4hr to allow for accurate repair selection Removes kademlia setups in: cmd/storagenode/main.go cmd/storj-sim/network.go internal/testplane/planet.go internal/testplanet/satellite.go internal/testplanet/storagenode.go satellite/peer.go scripts/test-sim-backwards.sh scripts/testdata/satellite-config.yaml.lock storagenode/inspector/inspector.go storagenode/peer.go storagenode/storagenodedb/database.go Why: Replacing Kademlia Please describe the tests: • internal/testplanet/planet_test.go: TestBasic: assert that the storagenode can check in with the satellite without any errors TestContact: test that all nodes get inserted into both satellites' overlay cache during testplanet setup • satellite/contact/contact_test.go: TestFetchInfo: Tests that the FetchInfo method returns the correct info • storagenode/contact/contact_test.go: TestNodeInfoUpdated: tests that the contact chore updates the node information TestRequestInfoEndpoint: tests that the Request info endpoint returns the correct info Please describe the performance impact: Node discovery should be at least slightly more performant since each node connects directly to each satellite and no longer needs to wait for bootstrapping. It probably won't be faster in real time on start up since each node waits a random amount of time (less than 1 hr) to initialize its first connection (jitter).
2019-09-19 20:56:34 +01:00
# the public address of the node, useful for nodes behind NAT
contact.external-address: ""
# timeout for pinging storage nodes
# contact.timeout: 10m0s
# satellite database connection string
# database: postgres://
# satellite database api key lru capacity
# database-options.api-keys-cache.capacity: 1000
# satellite database api key expiration
# database-options.api-keys-cache.expiration: 1m0s
# how often to delete expired serial numbers
# db-cleanup.serials-interval: 24h0m0s
# Maximum Database Connection Lifetime, -1ns means the stdlib default
# db.conn_max_lifetime: -1ns
# Maximum Amount of Idle Database connections, -1 means the stdlib default
# db.max_idle_conns: 20
# Maximum Amount of Open Database connections, -1 means the stdlib default
# db.max_open_conns: 25
# address to listen on for debug endpoints
# debug.addr: 127.0.0.1:0
# expose control panel
# debug.control: false
pkg/process: Now that we are trying to identify the root cause of the satellite load limitations (i.e. currently the satellite has a max ability of 400 rps for uploads and we need this to be higher), we are using the golang diagnostic tools to collect insight into what the bottlenecks are. We currently have a debug endpoint to gather some cpu and mem data, but it could be useful to have continuous profiling. GCP stackdriver has support for continuous profiling so lets set that up and see if it is helpful to gather more data. This PR adds support for [GCP continuous profiler](https://cloud.google.com/profiler) which allows enabling continuous cpu/mem profiling and the stats are sent to stackdriver in google cloud console. To enable the continuous profiling for a storj component, do the following: - prereq: the workload must be running in GKE and have Stackdriver Profiling IAM role permissions - provide the config flag `debug.profilename` in the config.yaml file for the workload (i.e. satellite api process, etc). The profilename should be the workload name, for example "satellite-api". - once the above config flag is provided, the profiler will be initialized and profiling stats will automatically be sent to GCP project where the workload is running and viewable in the Stackdriver Profile page in the console The current implementation assumes the workload is running in GKE, however if we find if useful we can add support to enable this from anywhere. But for simplicity, its configured this way assuming the main goal is to enable in production systems. Change-Id: Ibf8ebe2df7bf06fdd4951ee6a1e48854dd36ad47
2020-02-25 16:46:12 +00:00
# provide the name of the peer to enable continuous cpu/mem profiling for
# debug.profilername: ""
# If set, a path to write a process trace SVG to
# debug.trace-out: ""
# how often to run the downtime detection chore.
# downtime.detection-interval: 1h0m0s
# the downtime estimation chore should check this many offline nodes
# downtime.estimation-batch-size: 1000
# max number of concurrent connections in estimation chore
# downtime.estimation-concurrency-limit: 10
# how often to run the downtime estimation chore
# downtime.estimation-interval: 1h0m0s
# the number of nodes to concurrently send garbage collection bloom filters to
# garbage-collection.concurrent-sends: 1
# set if garbage collection is enabled or not
# garbage-collection.enabled: true
# the false positive rate used for creating a garbage collection bloom filter
# garbage-collection.false-positive-rate: 0.1
# the initial number of pieces expected for a storage node to have, used for creating a filter
# garbage-collection.initial-pieces: 400000
# the time between each send of garbage collection filters to storage nodes
# garbage-collection.interval: 120h0m0s
# the amount of time to allow a node to handle a retain request
# garbage-collection.retain-send-timeout: 1m0s
# if true, run garbage collection as part of the core
# garbage-collection.run-in-core: false
# if true, skip the first run of GC
# garbage-collection.skip-first: true
# size of the buffer used to batch inserts into the transfer queue.
# graceful-exit.chore-batch-size: 500
# how often to run the transfer queue chore.
# graceful-exit.chore-interval: 30s
# whether or not graceful exit is enabled on the satellite side.
# graceful-exit.enabled: true
# size of the buffer used to batch transfer queue reads and sends to the storage node.
# graceful-exit.endpoint-batch-size: 300
# maximum number of transfer failures per piece.
# graceful-exit.max-failures-per-piece: 5
# maximum inactive time frame of transfer activities per node.
# graceful-exit.max-inactive-time-frame: 168h0m0s
# maximum number of order limits a satellite sends to a node before marking piece transfer failed
# graceful-exit.max-order-limit-send-count: 10
# minimum age for a node on the network in order to initiate graceful exit
# graceful-exit.node-min-age-in-months: 6
# maximum percentage of transfer failures per node.
# graceful-exit.overall-max-failures-percentage: 10
# the minimum duration for receiving a stream from a storage node before timing out
# graceful-exit.recv-timeout: 10m0s
# path to the certificate chain for this identity
identity.cert-path: /root/.local/share/storj/identity/satellite/identity.cert
# path to the private key for this identity
identity.key-path: /root/.local/share/storj/identity/satellite/identity.key
# what to use for storing real-time accounting data
satellite/accounting: refactor live accounting to hold current estimated totals live accounting used to be a cache to store writes before they are picked up during the tally iteration, after which the cache is cleared. This created a window in which users could potentially exceed the storage limit. This PR refactors live accounting to hold current estimations of space used per project. This should also reduce DB load since we no longer need to query the satellite DB when checking space used for limiting. The mechanism by which the new live accounting system works is as follows: During the upload of any segment, the size of that segment is added to its respective project total in live accounting. At the beginning of the tally iteration we record the current values in live accounting as `initialLiveTotals`. At the end of the tally iteration we again record the current totals in live accounting as `latestLiveTotals`. The metainfo loop observer in tally allows us to get the project totals from what it observed in metainfo DB which are stored in `tallyProjectTotals`. However, for any particular segment uploaded during the metainfo loop, the observer may or may not have seen it. Thus, we take half of the difference between `latestLiveTotals` and `initialLiveTotals`, and add that to the total that was found during tally and set that as the new live accounting total. Initially, live accounting was storing the total stored amount across all nodes rather than the segment size, which is inconsistent with how we record amounts stored in the project accounting DB, so we have refactored live accounting to record segment size Change-Id: Ie48bfdef453428fcdc180b2d781a69d58fd927fb
2019-10-31 17:27:38 +00:00
# live-accounting.storage-backend: ""
# if true, log function filename and line number
# log.caller: false
# if true, set logging to development mode
# log.development: false
# configures log encoding. can either be 'console' or 'json'
# log.encoding: console
# the minimum log level to log
# log.level: info
# can be stdout, stderr, or a filename
# log.output: stderr
# if true, log stack traces
# log.stack: false
# smtp authentication type
# mail.auth-type: login
# oauth2 app's client id
# mail.client-id: ""
# oauth2 app's client secret
# mail.client-secret: ""
# sender email address
# mail.from: ""
# plain/login auth user login
# mail.login: ""
# plain/login auth user password
# mail.password: ""
# refresh token used to retrieve new access token
# mail.refresh-token: ""
# smtp server address
# mail.smtp-server-address: ""
# path to email templates source
# mail.template-path: ""
# uri which is used when retrieving new access token
# mail.token-uri: ""
# server address of the marketing Admin GUI
# marketing.address: 127.0.0.1:8090
# base url for marketing Admin GUI
# marketing.base-url: ""
# path to static resources
# marketing.static-dir: ""
# the database connection string to use
# metainfo.database-url: postgres://
# how long to wait for new observers before starting iteration
# metainfo.loop.coalesce-duration: 5s
# metainfo loop rate limit (default is 0 which is unlimited segments per second)
# metainfo.loop.rate-limit: 0
# maximum time allowed to pass between creating and committing a segment
# metainfo.max-commit-interval: 48h0m0s
# maximum inline segment size
# metainfo.max-inline-segment-size: 4.0 KiB
# maximum segment size
# metainfo.max-segment-size: 64.0 MiB
# minimum remote segment size
# metainfo.min-remote-segment-size: 1.2 KiB
# toggle flag if overlay is enabled
# metainfo.overlay: true
# timeout for dialing nodes (0 means satellite default)
# metainfo.piece-deletion.dial-timeout: 0s
# threshold for retrying a failed node
# metainfo.piece-deletion.fail-threshold: 5m0s
# maximum number of concurrent requests to storage nodes
# metainfo.piece-deletion.max-concurrency: 100
# maximum number of pieces per batch
# metainfo.piece-deletion.max-pieces-per-batch: 5000
# maximum number pieces per single request
# metainfo.piece-deletion.max-pieces-per-request: 1000
# timeout for a single delete request
# metainfo.piece-deletion.request-timeout: 1m0s
# number of projects to cache.
# metainfo.rate-limiter.cache-capacity: 10000
# how long to cache the projects limiter.
# metainfo.rate-limiter.cache-expiration: 10m0s
# whether rate limiting is enabled.
# metainfo.rate-limiter.enabled: true
# request rate per project per second.
# metainfo.rate-limiter.rate: 1000
# the size of each new erasure share in bytes
# metainfo.rs.erasure-share-size: 256 B
# maximum buffer memory to be allocated for read buffers
# metainfo.rs.max-buffer-mem: 4.0 MiB
# the largest amount of pieces to encode to. n (upper bound for validation).
# metainfo.rs.max-total-threshold: 130
# the minimum pieces required to recover a segment. k.
# metainfo.rs.min-threshold: 29
# the largest amount of pieces to encode to. n (lower bound for validation).
# metainfo.rs.min-total-threshold: 95
# the minimum safe pieces before a repair is triggered. m.
# metainfo.rs.repair-threshold: 35
# the desired total pieces for a segment. o.
# metainfo.rs.success-threshold: 80
# the largest amount of pieces to encode to. n.
# metainfo.rs.total-threshold: 110
# validate redundancy scheme configuration
# metainfo.rs.validate: true
# address(es) to send telemetry to (comma-separated)
# metrics.addr: collectora.storj.io:9000
# application name for telemetry identification
# metrics.app: satellite
# application suffix
# metrics.app-suffix: -release
# the time between each metrics chore run
# metrics.chore-interval: 15m0s
# instance id prefix
# metrics.instance-prefix: ""
# how frequently to send up telemetry
# metrics.interval: 1m0s
# path to log for oom notices
# monkit.hw.oomlog: /var/log/kern.log
# how long until an order expires
# orders.expiration: 48h0m0s
# how many items in the rollups write cache before they are flushed to the database
# orders.flush-batch-size: 10000
# how often to flush the rollups write cache to the database
# orders.flush-interval: 1m0s
# log the offline/disqualification status of nodes
# orders.node-status-logging: false
# how many records to read in a single transaction when calculating billable bandwidth
# orders.reported-rollups-read-batch-size: 1000
# how many orders to batch per transaction
# orders.settlement-batch-size: 250
# the number of times a node has been audited to not be considered a New Node
2019-07-02 00:02:23 +01:00
# overlay.node.audit-count: 100
# the reputation cut-off for disqualifying SNs based on audit history
# overlay.node.audit-reputation-dq: 0.6
# the forgetting factor used to calculate the audit SNs reputation
# overlay.node.audit-reputation-lambda: 0.95
# weight to apply to audit reputation for total repair reputation calculation
# overlay.node.audit-reputation-repair-weight: 1
# weight to apply to audit reputation for total uplink reputation calculation
# overlay.node.audit-reputation-uplink-weight: 1
# the normalization weight used to calculate the audit SNs reputation
# overlay.node.audit-reputation-weight: 1
# require distinct IPs when choosing nodes for upload
# overlay.node.distinct-ip: true
# how much disk space a node at minimum must have to be selected for upload
# overlay.node.minimum-disk-space: 100.0 MB
# the minimum node software version for node selection queries
# overlay.node.minimum-version: ""
# the fraction of new nodes allowed per request
# overlay.node.new-node-fraction: 0.05
# the amount of time without seeing a node before its considered offline
# overlay.node.online-window: 4h0m0s
# the number of times a node's uptime has been checked to not be considered a New Node
2019-07-02 00:02:23 +01:00
# overlay.node.uptime-count: 100
# number of update requests to process per transaction
# overlay.update-stats-batch-size: 100
# amount of percents that user will earn as bonus credits by depositing in STORJ tokens
# payments.bonus-rate: 10
# duration a new coupon is valid in months/billing cycles
# payments.coupon-duration: 2
# project limit to which increase to after applying the coupon, 0 B means not changing it from the default
# payments.coupon-project-limit: 0 B
# coupon value in cents
# payments.coupon-value: 30
# price user should pay for each TB of egress
# payments.egress-tb-price: "45"
# minimum value of coin payments in cents before coupon is applied
# payments.min-coin-payment: 5000
# price node receive for storing TB of audit in cents
# payments.node-audit-bandwidth-price: 1000
# price node receive for storing disk space in cents/TB
# payments.node-disk-space-price: 150
# price node receive for storing TB of egress in cents
# payments.node-egress-bandwidth-price: 2000
# price node receive for storing TB of repair in cents
# payments.node-repair-bandwidth-price: 1000
# price user should pay for each object stored in network per month
# payments.object-price: "0.0000022"
# payments provider to use
# payments.provider: ""
# price user should pay for storing TB per month
# payments.storage-tb-price: "10"
# amount of time we wait before running next account balance update loop
# payments.stripe-coin-payments.account-balance-update-interval: 1h30m0s
# toogle autoadvance feature for invoice creation
# payments.stripe-coin-payments.auto-advance: false
# coinpayments API private key key
# payments.stripe-coin-payments.coinpayments-private-key: ""
# coinpayments API public key
# payments.stripe-coin-payments.coinpayments-public-key: ""
# amount of time we wait before running next conversion rates update loop
# payments.stripe-coin-payments.conversion-rates-cycle-interval: 10m0s
# stripe API public key
# payments.stripe-coin-payments.stripe-public-key: ""
# stripe API secret key
# payments.stripe-coin-payments.stripe-secret-key: ""
# amount of time we wait before running next transaction update loop
# payments.stripe-coin-payments.transaction-update-interval: 30m0s
# referrals.referral-manager-url: ""
# time limit for downloading pieces from a node for repair
# repairer.download-timeout: 5m0s
# whether to download pieces for repair in memory (true) or download to disk (false)
# repairer.in-memory-repair: false
2019-06-04 13:13:31 +01:00
# how frequently repairer should try and repair more data
# repairer.interval: 5m0s
# maximum buffer memory (in bytes) to be allocated for read buffers
# repairer.max-buffer-mem: 4.0 MB
[V3-1927] Repairer uploads to max threshold instead of success… (#2423) * pkg/datarepair: Add test to check num upload pieces Add a new test for ensuring the number of pieces that the repair process upload when a segment is injured. * satellite/orders: Don't create "put order limits" over total Repair must not create "put order limits" more than the total count. * pkg/datarepair: Update upload repair pieces test Update the test which checks the number of pieces which are uploaded during a repair for using the same excess over the success threshold value than the implementation. * satellites/orders: Limit repair put order for not being total Limit the number of put orders to be used by repair for only uploading pieces to a % excess over the successful threshold. * pkg/datarepair: Change DataRepair test to pass again Make some changes in the DataRepair test to make pass again after the repair upload repaired pieces only until a % excess over success threshold. Also update the steps description of the DataRepair test after it has been changed, to match on what's now, besides to leave it more generic for avoiding having to update it on minimal future refactorings. * satellite: Make repair excess optimal threshold configurable Add a new configuration parameter to the satellite for being able to configure the percentage excess over the optimal threshold, used for determining how many pieces should be repaired/uploaded, rather than having the value hard coded. * repairer: Add configurable param to segments/repairer Add a new parameters to the segment/repairer to calculate the maximum number of excess nodes, based on the optimal threshold, that repaired pieces can be uploaded. This new parameter has been added for not returning more nodes than the number of upload orders for data repair satellite service calculate for repairing pieces. * pkg/storage/ec: Update log message in clien.Repair * satellite: Update configuration lock file
2019-07-11 23:44:47 +01:00
# ratio applied to the optimal threshold to calculate the excess of the maximum number of repaired pieces to upload
# repairer.max-excess-rate-optimal-threshold: 0.05
# maximum segments that can be repaired concurrently
# repairer.max-repair: 5
# time limit for uploading repaired pieces to new storage nodes
# repairer.timeout: 5m0s
# time limit for an entire repair job, from queue pop to upload completion
# repairer.total-timeout: 45m0s
# how often to flush the reported serial rollups to the database
# reported-rollup.interval: 5m0s
# option for deleting tallies after they are rolled up
# rollup.delete-tallies: true
# how frequently rollup should run
# rollup.interval: 24h0m0s
# the bandwidth and storage usage limit for the alpha release
# rollup.max-alpha-usage: 5.0 GB
# public address to listen on
server.address: :7777
# log all GRPC traffic to zap logger
server.debug-log-traffic: false
# if true, client leaves may contain the most recent certificate revocation for the current certificate
# server.extensions.revocation: true
# if true, client leaves must contain a valid "signed certificate extension" (NB: verified against certs in the peer ca whitelist; i.e. if true, a whitelist must be provided)
# server.extensions.whitelist-signed-leaf: false
# path to the CA cert whitelist (peer identities must be signed by one these to be verified). this will override the default peer whitelist
# server.peer-ca-whitelist-path: ""
# identity version(s) the server will be allowed to talk to
# server.peer-id-versions: latest
# private address to listen on
server.private-address: 127.0.0.1:7778
# url for revocation database (e.g. bolt://some.db OR redis://127.0.0.1:6378?db=2&password=abc123)
# server.revocation-dburl: bolt://testdata/revocations.db
# if true, uses peer ca whitelist checking
# server.use-peer-ca-whitelist: true
# how frequently the tally service should run
# tally.interval: 1h0m0s
# address for jaeger agent
# tracing.agent-addr: agent.tracing.datasci.storj.io
# application name for tracing identification
# tracing.app: satellite
# application suffix
# tracing.app-suffix: -release
# buffer size for collector batch packet size
# tracing.buffer-size: 0
# how frequently to send up telemetry
# tracing.interval: 0s
# buffer size for collector queue size
# tracing.queue-size: 0
# how frequently to send up telemetry
# tracing.sample: 0
# Interval to check the version
# version.check-interval: 15m0s
# Request timeout for version checks
# version.request-timeout: 1m0s
# server address to check its version against
# version.server-address: https://version.storj.io