docs/blueprints: add async packing

Describe a process for packing small files. Change-Id: I9f6ad9a85ccf0a92705a47e2ff8efa0072e65230
2021-12-15 13:36:06 +02:00 · 2021-12-15 13:36:06 +02:00 · 0513d4cf3a
commit 0513d4cf3a
parent 847ddaaab0
1 changed files with 261 additions and 0 deletions
--- a/docs/blueprints/satellite-side-async-packing.md
+++ b/docs/blueprints/satellite-side-async-packing.md
@ -0,0 +1,261 @@
+# Satellite Side Async Packing
+
+## Abstract
+
+This blueprint describes a way to pack small objects without backwards
+incompatible changes on the uplink side and should be relatively easy
+to implement.
+
+The rough idea is to have a background process that packs together
+small object segments into a single segment. Then the objects will
+refer to a single segment.
+
+## Background
+
+Small files have several disadvantages in the current system.
+
+- Every small file requires one object and one segment.
+- Every small piece on storage node uses more disk than needed.
+- Metabase segment loop has a lot of things to iterate.
+- There is a significant connection overhead for small files. _This design does not improve this._
+
+## Design
+
+In metabase database, multiple objects will refer to a single stream (segment),
+with sub-ranges. This implies that the segments table will now need to contain
+a reference count among other things.
+
+A packing background process will:
+
+1. query the objects table one bucket at a time,
+   to discover small objects that haven't been packed together;
+2. download encrypted object data into memory;
+3. rebuild pieces as necessary;
+4. concatenate the pieces together, keeping track where each segment is located
+   with `stream_offset` and `stream_length`.
+5. upload the packed segment to a set of storagenodes
+6. replace the stream references in the objects table, with the packed segments
+
+The satellite, on a download request, will add the `stream_offset` and
+`stream_length` to the `OrderLimit`. The uplink doesn't need to be aware of this
+change. The storagenode when getting such an order limit, will
+appropriately only read a subrange from the stored piece.
+
+Note, we should not delete the old segments at step 5., otherwise we might
+delete a segment that is being actively downloaded. We need to preserve old
+unpacked segments at least the duration of order limit validity duration
+(currently 48h). This could be achieved by updating segment TTL-s on the
+storagenode.
+
+Repair, audit, project accounting may need adjustments. Storage node accounting
+and GC should work as is.
+
+We also need a process to handle pack fragmentation. For example when 99% of
+packed segment is "deleted", then we should repack. One option to implement this
+is to add a process that iterates packed segments and looks at the
+"garbage_bytes" -- however there's no easy way to mark the "objects"
+as needing repacking.
+
+The reference counting, of course, would add overhead to all deletes.
+
+## Design - Uplink
+
+It would be possible to implement additional API to uplink that uploads packed
+segments from the start. A very rough implementation could look like:
+
+```
+pack := uplink.BeginPack()
+for _, file := range files {
+	err := pack.Upload(file.Name, ..., file.Data)
+	if err != nil {
+		return err
+	}
+}
+pack.Commit()
+```
+
+This is not the final design, it will be revisited when implementing the API.
+
+The main problem with this approach is how to handle going across segment
+boundaries, since we don't want a `2 byte` object to be stored on two different
+segments.
+
+The easiest way to avoid boundary issue is to force the user to
+specify the size upfront. e.g. taking []byte as argument or taking a size
+as an argument for starting the upload.
+
+The alternative is to over-allocate bandwidth and set a limit for split
+position. e.g. when packed segment is over 32MiB then break whenever the
+current object is finished. When the object is actually large, then split
+the segment as usual.
+
+## Rationale
+
+It's possible to implement the same packing across all buckets and projects,
+however this would have signifcant issues with deleting whole buckets and
+projects. When deleting a bucket or project we would be able to directly
+delete the segments.
+
+This satellite side packing does have an ingress and egress cost, however it
+should outweigh the long-term satellite-side storage cost.
+
+## Implementation
+
+### Storage Node
+
+Add `stream_offset` and `segment_length` to `pb.OrderLimit`. The storage node
+should respect these values, and uplink should not need to treat such order
+limits separately.
+
+### Satellite
+
+#### Pack upload selection
+
+For uploading packed segments we should only upload to storage nodes
+that support the `stream_offset` feature.
+
+#### Metabase changes
+
+Update metabase tables:
+
+- add `stream_offset`, `stream_length` to `objects` table, to track the
+  location in the stream / segment;
+- add `stream_encrypted_key_nonce`, `stream_encrypted_key` etc. to `objects` table,
+  to track necessary information for decryption;
+- add `reference_count` to `segments` table, to track how many objects are
+  still referencing a particular segment
+- add `garbage_bytes` to `segments` table, to track how fragmented a given
+  packed segment is.
+
+Object deletes need to update `reference_count` and `garbage_count`.
+
+The `stream_encrypted_key` etc. could also be stored either in the segments
+table as a separate field. Or even interleaved in the segments themselves.
+The appropriate location should be tested.
+
+New API will be needed to:
+
+- find a batch of objects to pack,
+- replace the objects with a pack.
+
+The replacing of objects should assume that the replacing the pack partially
+may fail due to concurrent updates or deletes. The packing should still succeed
+when most of the replacements succeed.
+
+We need to take care that we don't break dowloads while replacing the streams
+in objects. Currently, stream_id is being used for different identity purposes.
+
+It's quite likely we'll eventually hit a situation where:
+
+1. uplink starts downloading object X, with piece id Y
+2. satellite repacks X to a new segment
+3. satellite sends deletes to storagenodes with piece id Y
+4. uplink fails to download object X
+
+The satellite needs to take care that the piece id Y is stored at least
+until the downloading "token" expires.
+
+#### Packing Process
+
+The packing process will need to:
+
+1. query the objects table one bucket at a time,
+  to discover small committed objects that haven't been packed together;
+2. download encrypted object data into memory;
+3. rebuild pieces as necessary;
+4. concatenate the pieces together, keeping track where each segment is located
+  with `stream_offset` and `stream_length`.
+5. upload the packed segment to a set of storagenodes
+6. replace the stream references in the objects table, with the packed segments
+
+For choosing `expires_at` in segments table we can use the maximum of the given
+segments. The object will still track the appropriate `expires_at` date and
+during zombie deletion the `garbage_bytes` can be updated on the segment as
+necessary.
+
+When a segment has an `encrypted_etag` we can handle these in a few possible
+ways. First, we can ignore them; since the packing only applies to small
+objects, which usually wouldn't be uploaded via multipart upload. Secondly,
+since the etag-s are only relevant for multipart upload, we could drop them
+once we commit (assuming S3 doesn't need them later).
+
+The object packing should take into account:
+
+- total size of the pack - choosing too few objects means the packing
+  is less effective,
+- expires_at time - choosing similar values means the segment needs to
+  hold less garbage,
+- encrypted object key - choosing objects with the same prefix are more
+  likely to be deleted together,
+- upload time - things uploaded at similar times are more likely to
+  be deleted at similar times.
+
+None of these are critical for the first implementation, however, we
+should try to figure out good heuristics for packing to reduce fragmentation.
+
+It would be possible to avoid satellite-side bandwidth overhead by letting
+storagenodes send the pieces to each other and constructing the packed
+segment / piece that way. This however has significantly more complicated
+protocol, as we've seen from graceful exit.
+
+Note, the object packing creates a strong long-tail cancellation and locality
+preference towards the repacking or repair node. Hence, the repacking and repair
+nodes need to be more lenient with regards to long-tail cancellation.
+
+#### Re-packing Process
+
+The re-packing process will need to mark `objects` as needing repacking.
+
+By inspecting the `garbage_bytes` and `encrypted_size` in segments, it's
+possible to decide whether a segment should be repacked.
+
+TODO: figure out how to mark objects as needing repacking.
+
+#### Service updates
+
+TODO: review how services need to be updated.
+
+## Alternative Solutions
+
+### Storagenode Cohorts
+
+One alternative approach would be to keep a "appendable segment" and during
+upload choose a segment to append to. This avoids the need for a background process.
+
+To implement, it would need some sort of in-memory index to choose appropriate
+appendable segment, based on the aspects mentioned in "Packing Process". This
+appropriate selection is not complicated to implement.
+
+However, the failure scenarios are much more difficult to handle correctly.
+
+First the storagenode now needs to have a notion of "segment being modified".
+This modification can happen during ongoing downloads. Similarly, flushing
+data to a [disk is non-trivial](https://www.sqlite.org/atomiccommit.html).
+The current model of "write, flush, commit" without modification helps to
+avoid many such problems.
+
+The satellite then needs to take into account storage nodes failing. The
+failure could be due to network issue, node restart, long tail cancellation.
+When the uplink receives such failure during upload, the whole "packed
+segment" so far would need to be discarded on that storage node. When the
+uploads are spaced more apart in time, there's more likely that during
+each "append" one node fails. Tracking additional information for partial
+uploads would defeat the purpose of packing. Or in other words -- the failure
+to upload to a cohort multiplies over time. This means you probably can
+upload to a cohort some maximum number of times, before it would need to be
+repaired.
+
+The appendable segments also will need to be audited and repaired, complicating
+the situation further.
+
+## Wrapup
+
+[Who will archive the blueprint when completed? What documentation needs to be
+updated to preserve the relevant information from the blueprint?]
+
+## Open issues
+
+- Marking objects for repacking
+- Concurrency issue with upload and packing
+- Services that need updating.
+- Calculations and simulation of the ingress/egress cost.