storj/docs/blueprints/satellite-side-async-packing.md
Egon Elbre 0513d4cf3a docs/blueprints: add async packing
Describe a process for packing small files.

Change-Id: I9f6ad9a85ccf0a92705a47e2ff8efa0072e65230
2022-04-23 06:34:17 +00:00

262 lines
10 KiB
Markdown

# Satellite Side Async Packing
## Abstract
This blueprint describes a way to pack small objects without backwards
incompatible changes on the uplink side and should be relatively easy
to implement.
The rough idea is to have a background process that packs together
small object segments into a single segment. Then the objects will
refer to a single segment.
## Background
Small files have several disadvantages in the current system.
- Every small file requires one object and one segment.
- Every small piece on storage node uses more disk than needed.
- Metabase segment loop has a lot of things to iterate.
- There is a significant connection overhead for small files. _This design does not improve this._
## Design
In metabase database, multiple objects will refer to a single stream (segment),
with sub-ranges. This implies that the segments table will now need to contain
a reference count among other things.
A packing background process will:
1. query the objects table one bucket at a time,
to discover small objects that haven't been packed together;
2. download encrypted object data into memory;
3. rebuild pieces as necessary;
4. concatenate the pieces together, keeping track where each segment is located
with `stream_offset` and `stream_length`.
5. upload the packed segment to a set of storagenodes
6. replace the stream references in the objects table, with the packed segments
The satellite, on a download request, will add the `stream_offset` and
`stream_length` to the `OrderLimit`. The uplink doesn't need to be aware of this
change. The storagenode when getting such an order limit, will
appropriately only read a subrange from the stored piece.
Note, we should not delete the old segments at step 5., otherwise we might
delete a segment that is being actively downloaded. We need to preserve old
unpacked segments at least the duration of order limit validity duration
(currently 48h). This could be achieved by updating segment TTL-s on the
storagenode.
Repair, audit, project accounting may need adjustments. Storage node accounting
and GC should work as is.
We also need a process to handle pack fragmentation. For example when 99% of
packed segment is "deleted", then we should repack. One option to implement this
is to add a process that iterates packed segments and looks at the
"garbage_bytes" -- however there's no easy way to mark the "objects"
as needing repacking.
The reference counting, of course, would add overhead to all deletes.
## Design - Uplink
It would be possible to implement additional API to uplink that uploads packed
segments from the start. A very rough implementation could look like:
```
pack := uplink.BeginPack()
for _, file := range files {
err := pack.Upload(file.Name, ..., file.Data)
if err != nil {
return err
}
}
pack.Commit()
```
This is not the final design, it will be revisited when implementing the API.
The main problem with this approach is how to handle going across segment
boundaries, since we don't want a `2 byte` object to be stored on two different
segments.
The easiest way to avoid boundary issue is to force the user to
specify the size upfront. e.g. taking []byte as argument or taking a size
as an argument for starting the upload.
The alternative is to over-allocate bandwidth and set a limit for split
position. e.g. when packed segment is over 32MiB then break whenever the
current object is finished. When the object is actually large, then split
the segment as usual.
## Rationale
It's possible to implement the same packing across all buckets and projects,
however this would have signifcant issues with deleting whole buckets and
projects. When deleting a bucket or project we would be able to directly
delete the segments.
This satellite side packing does have an ingress and egress cost, however it
should outweigh the long-term satellite-side storage cost.
## Implementation
### Storage Node
Add `stream_offset` and `segment_length` to `pb.OrderLimit`. The storage node
should respect these values, and uplink should not need to treat such order
limits separately.
### Satellite
#### Pack upload selection
For uploading packed segments we should only upload to storage nodes
that support the `stream_offset` feature.
#### Metabase changes
Update metabase tables:
- add `stream_offset`, `stream_length` to `objects` table, to track the
location in the stream / segment;
- add `stream_encrypted_key_nonce`, `stream_encrypted_key` etc. to `objects` table,
to track necessary information for decryption;
- add `reference_count` to `segments` table, to track how many objects are
still referencing a particular segment
- add `garbage_bytes` to `segments` table, to track how fragmented a given
packed segment is.
Object deletes need to update `reference_count` and `garbage_count`.
The `stream_encrypted_key` etc. could also be stored either in the segments
table as a separate field. Or even interleaved in the segments themselves.
The appropriate location should be tested.
New API will be needed to:
- find a batch of objects to pack,
- replace the objects with a pack.
The replacing of objects should assume that the replacing the pack partially
may fail due to concurrent updates or deletes. The packing should still succeed
when most of the replacements succeed.
We need to take care that we don't break dowloads while replacing the streams
in objects. Currently, stream_id is being used for different identity purposes.
It's quite likely we'll eventually hit a situation where:
1. uplink starts downloading object X, with piece id Y
2. satellite repacks X to a new segment
3. satellite sends deletes to storagenodes with piece id Y
4. uplink fails to download object X
The satellite needs to take care that the piece id Y is stored at least
until the downloading "token" expires.
#### Packing Process
The packing process will need to:
1. query the objects table one bucket at a time,
to discover small committed objects that haven't been packed together;
2. download encrypted object data into memory;
3. rebuild pieces as necessary;
4. concatenate the pieces together, keeping track where each segment is located
with `stream_offset` and `stream_length`.
5. upload the packed segment to a set of storagenodes
6. replace the stream references in the objects table, with the packed segments
For choosing `expires_at` in segments table we can use the maximum of the given
segments. The object will still track the appropriate `expires_at` date and
during zombie deletion the `garbage_bytes` can be updated on the segment as
necessary.
When a segment has an `encrypted_etag` we can handle these in a few possible
ways. First, we can ignore them; since the packing only applies to small
objects, which usually wouldn't be uploaded via multipart upload. Secondly,
since the etag-s are only relevant for multipart upload, we could drop them
once we commit (assuming S3 doesn't need them later).
The object packing should take into account:
- total size of the pack - choosing too few objects means the packing
is less effective,
- expires_at time - choosing similar values means the segment needs to
hold less garbage,
- encrypted object key - choosing objects with the same prefix are more
likely to be deleted together,
- upload time - things uploaded at similar times are more likely to
be deleted at similar times.
None of these are critical for the first implementation, however, we
should try to figure out good heuristics for packing to reduce fragmentation.
It would be possible to avoid satellite-side bandwidth overhead by letting
storagenodes send the pieces to each other and constructing the packed
segment / piece that way. This however has significantly more complicated
protocol, as we've seen from graceful exit.
Note, the object packing creates a strong long-tail cancellation and locality
preference towards the repacking or repair node. Hence, the repacking and repair
nodes need to be more lenient with regards to long-tail cancellation.
#### Re-packing Process
The re-packing process will need to mark `objects` as needing repacking.
By inspecting the `garbage_bytes` and `encrypted_size` in segments, it's
possible to decide whether a segment should be repacked.
TODO: figure out how to mark objects as needing repacking.
#### Service updates
TODO: review how services need to be updated.
## Alternative Solutions
### Storagenode Cohorts
One alternative approach would be to keep a "appendable segment" and during
upload choose a segment to append to. This avoids the need for a background process.
To implement, it would need some sort of in-memory index to choose appropriate
appendable segment, based on the aspects mentioned in "Packing Process". This
appropriate selection is not complicated to implement.
However, the failure scenarios are much more difficult to handle correctly.
First the storagenode now needs to have a notion of "segment being modified".
This modification can happen during ongoing downloads. Similarly, flushing
data to a [disk is non-trivial](https://www.sqlite.org/atomiccommit.html).
The current model of "write, flush, commit" without modification helps to
avoid many such problems.
The satellite then needs to take into account storage nodes failing. The
failure could be due to network issue, node restart, long tail cancellation.
When the uplink receives such failure during upload, the whole "packed
segment" so far would need to be discarded on that storage node. When the
uploads are spaced more apart in time, there's more likely that during
each "append" one node fails. Tracking additional information for partial
uploads would defeat the purpose of packing. Or in other words -- the failure
to upload to a cohort multiplies over time. This means you probably can
upload to a cohort some maximum number of times, before it would need to be
repaired.
The appendable segments also will need to be audited and repaired, complicating
the situation further.
## Wrapup
[Who will archive the blueprint when completed? What documentation needs to be
updated to preserve the relevant information from the blueprint?]
## Open issues
- Marking objects for repacking
- Concurrency issue with upload and packing
- Services that need updating.
- Calculations and simulation of the ingress/egress cost.