0513d4cf3a
Describe a process for packing small files. Change-Id: I9f6ad9a85ccf0a92705a47e2ff8efa0072e65230
262 lines
10 KiB
Markdown
262 lines
10 KiB
Markdown
# Satellite Side Async Packing
|
|
|
|
## Abstract
|
|
|
|
This blueprint describes a way to pack small objects without backwards
|
|
incompatible changes on the uplink side and should be relatively easy
|
|
to implement.
|
|
|
|
The rough idea is to have a background process that packs together
|
|
small object segments into a single segment. Then the objects will
|
|
refer to a single segment.
|
|
|
|
## Background
|
|
|
|
Small files have several disadvantages in the current system.
|
|
|
|
- Every small file requires one object and one segment.
|
|
- Every small piece on storage node uses more disk than needed.
|
|
- Metabase segment loop has a lot of things to iterate.
|
|
- There is a significant connection overhead for small files. _This design does not improve this._
|
|
|
|
## Design
|
|
|
|
In metabase database, multiple objects will refer to a single stream (segment),
|
|
with sub-ranges. This implies that the segments table will now need to contain
|
|
a reference count among other things.
|
|
|
|
A packing background process will:
|
|
|
|
1. query the objects table one bucket at a time,
|
|
to discover small objects that haven't been packed together;
|
|
2. download encrypted object data into memory;
|
|
3. rebuild pieces as necessary;
|
|
4. concatenate the pieces together, keeping track where each segment is located
|
|
with `stream_offset` and `stream_length`.
|
|
5. upload the packed segment to a set of storagenodes
|
|
6. replace the stream references in the objects table, with the packed segments
|
|
|
|
The satellite, on a download request, will add the `stream_offset` and
|
|
`stream_length` to the `OrderLimit`. The uplink doesn't need to be aware of this
|
|
change. The storagenode when getting such an order limit, will
|
|
appropriately only read a subrange from the stored piece.
|
|
|
|
Note, we should not delete the old segments at step 5., otherwise we might
|
|
delete a segment that is being actively downloaded. We need to preserve old
|
|
unpacked segments at least the duration of order limit validity duration
|
|
(currently 48h). This could be achieved by updating segment TTL-s on the
|
|
storagenode.
|
|
|
|
Repair, audit, project accounting may need adjustments. Storage node accounting
|
|
and GC should work as is.
|
|
|
|
We also need a process to handle pack fragmentation. For example when 99% of
|
|
packed segment is "deleted", then we should repack. One option to implement this
|
|
is to add a process that iterates packed segments and looks at the
|
|
"garbage_bytes" -- however there's no easy way to mark the "objects"
|
|
as needing repacking.
|
|
|
|
The reference counting, of course, would add overhead to all deletes.
|
|
|
|
## Design - Uplink
|
|
|
|
It would be possible to implement additional API to uplink that uploads packed
|
|
segments from the start. A very rough implementation could look like:
|
|
|
|
```
|
|
pack := uplink.BeginPack()
|
|
for _, file := range files {
|
|
err := pack.Upload(file.Name, ..., file.Data)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
}
|
|
pack.Commit()
|
|
```
|
|
|
|
This is not the final design, it will be revisited when implementing the API.
|
|
|
|
The main problem with this approach is how to handle going across segment
|
|
boundaries, since we don't want a `2 byte` object to be stored on two different
|
|
segments.
|
|
|
|
The easiest way to avoid boundary issue is to force the user to
|
|
specify the size upfront. e.g. taking []byte as argument or taking a size
|
|
as an argument for starting the upload.
|
|
|
|
The alternative is to over-allocate bandwidth and set a limit for split
|
|
position. e.g. when packed segment is over 32MiB then break whenever the
|
|
current object is finished. When the object is actually large, then split
|
|
the segment as usual.
|
|
|
|
## Rationale
|
|
|
|
It's possible to implement the same packing across all buckets and projects,
|
|
however this would have signifcant issues with deleting whole buckets and
|
|
projects. When deleting a bucket or project we would be able to directly
|
|
delete the segments.
|
|
|
|
This satellite side packing does have an ingress and egress cost, however it
|
|
should outweigh the long-term satellite-side storage cost.
|
|
|
|
## Implementation
|
|
|
|
### Storage Node
|
|
|
|
Add `stream_offset` and `segment_length` to `pb.OrderLimit`. The storage node
|
|
should respect these values, and uplink should not need to treat such order
|
|
limits separately.
|
|
|
|
### Satellite
|
|
|
|
#### Pack upload selection
|
|
|
|
For uploading packed segments we should only upload to storage nodes
|
|
that support the `stream_offset` feature.
|
|
|
|
#### Metabase changes
|
|
|
|
Update metabase tables:
|
|
|
|
- add `stream_offset`, `stream_length` to `objects` table, to track the
|
|
location in the stream / segment;
|
|
- add `stream_encrypted_key_nonce`, `stream_encrypted_key` etc. to `objects` table,
|
|
to track necessary information for decryption;
|
|
- add `reference_count` to `segments` table, to track how many objects are
|
|
still referencing a particular segment
|
|
- add `garbage_bytes` to `segments` table, to track how fragmented a given
|
|
packed segment is.
|
|
|
|
Object deletes need to update `reference_count` and `garbage_count`.
|
|
|
|
The `stream_encrypted_key` etc. could also be stored either in the segments
|
|
table as a separate field. Or even interleaved in the segments themselves.
|
|
The appropriate location should be tested.
|
|
|
|
New API will be needed to:
|
|
|
|
- find a batch of objects to pack,
|
|
- replace the objects with a pack.
|
|
|
|
The replacing of objects should assume that the replacing the pack partially
|
|
may fail due to concurrent updates or deletes. The packing should still succeed
|
|
when most of the replacements succeed.
|
|
|
|
We need to take care that we don't break dowloads while replacing the streams
|
|
in objects. Currently, stream_id is being used for different identity purposes.
|
|
|
|
It's quite likely we'll eventually hit a situation where:
|
|
|
|
1. uplink starts downloading object X, with piece id Y
|
|
2. satellite repacks X to a new segment
|
|
3. satellite sends deletes to storagenodes with piece id Y
|
|
4. uplink fails to download object X
|
|
|
|
The satellite needs to take care that the piece id Y is stored at least
|
|
until the downloading "token" expires.
|
|
|
|
#### Packing Process
|
|
|
|
The packing process will need to:
|
|
|
|
1. query the objects table one bucket at a time,
|
|
to discover small committed objects that haven't been packed together;
|
|
2. download encrypted object data into memory;
|
|
3. rebuild pieces as necessary;
|
|
4. concatenate the pieces together, keeping track where each segment is located
|
|
with `stream_offset` and `stream_length`.
|
|
5. upload the packed segment to a set of storagenodes
|
|
6. replace the stream references in the objects table, with the packed segments
|
|
|
|
For choosing `expires_at` in segments table we can use the maximum of the given
|
|
segments. The object will still track the appropriate `expires_at` date and
|
|
during zombie deletion the `garbage_bytes` can be updated on the segment as
|
|
necessary.
|
|
|
|
When a segment has an `encrypted_etag` we can handle these in a few possible
|
|
ways. First, we can ignore them; since the packing only applies to small
|
|
objects, which usually wouldn't be uploaded via multipart upload. Secondly,
|
|
since the etag-s are only relevant for multipart upload, we could drop them
|
|
once we commit (assuming S3 doesn't need them later).
|
|
|
|
The object packing should take into account:
|
|
|
|
- total size of the pack - choosing too few objects means the packing
|
|
is less effective,
|
|
- expires_at time - choosing similar values means the segment needs to
|
|
hold less garbage,
|
|
- encrypted object key - choosing objects with the same prefix are more
|
|
likely to be deleted together,
|
|
- upload time - things uploaded at similar times are more likely to
|
|
be deleted at similar times.
|
|
|
|
None of these are critical for the first implementation, however, we
|
|
should try to figure out good heuristics for packing to reduce fragmentation.
|
|
|
|
It would be possible to avoid satellite-side bandwidth overhead by letting
|
|
storagenodes send the pieces to each other and constructing the packed
|
|
segment / piece that way. This however has significantly more complicated
|
|
protocol, as we've seen from graceful exit.
|
|
|
|
Note, the object packing creates a strong long-tail cancellation and locality
|
|
preference towards the repacking or repair node. Hence, the repacking and repair
|
|
nodes need to be more lenient with regards to long-tail cancellation.
|
|
|
|
#### Re-packing Process
|
|
|
|
The re-packing process will need to mark `objects` as needing repacking.
|
|
|
|
By inspecting the `garbage_bytes` and `encrypted_size` in segments, it's
|
|
possible to decide whether a segment should be repacked.
|
|
|
|
TODO: figure out how to mark objects as needing repacking.
|
|
|
|
#### Service updates
|
|
|
|
TODO: review how services need to be updated.
|
|
|
|
## Alternative Solutions
|
|
|
|
### Storagenode Cohorts
|
|
|
|
One alternative approach would be to keep a "appendable segment" and during
|
|
upload choose a segment to append to. This avoids the need for a background process.
|
|
|
|
To implement, it would need some sort of in-memory index to choose appropriate
|
|
appendable segment, based on the aspects mentioned in "Packing Process". This
|
|
appropriate selection is not complicated to implement.
|
|
|
|
However, the failure scenarios are much more difficult to handle correctly.
|
|
|
|
First the storagenode now needs to have a notion of "segment being modified".
|
|
This modification can happen during ongoing downloads. Similarly, flushing
|
|
data to a [disk is non-trivial](https://www.sqlite.org/atomiccommit.html).
|
|
The current model of "write, flush, commit" without modification helps to
|
|
avoid many such problems.
|
|
|
|
The satellite then needs to take into account storage nodes failing. The
|
|
failure could be due to network issue, node restart, long tail cancellation.
|
|
When the uplink receives such failure during upload, the whole "packed
|
|
segment" so far would need to be discarded on that storage node. When the
|
|
uploads are spaced more apart in time, there's more likely that during
|
|
each "append" one node fails. Tracking additional information for partial
|
|
uploads would defeat the purpose of packing. Or in other words -- the failure
|
|
to upload to a cohort multiplies over time. This means you probably can
|
|
upload to a cohort some maximum number of times, before it would need to be
|
|
repaired.
|
|
|
|
The appendable segments also will need to be audited and repaired, complicating
|
|
the situation further.
|
|
|
|
## Wrapup
|
|
|
|
[Who will archive the blueprint when completed? What documentation needs to be
|
|
updated to preserve the relevant information from the blueprint?]
|
|
|
|
## Open issues
|
|
|
|
- Marking objects for repacking
|
|
- Concurrency issue with upload and packing
|
|
- Services that need updating.
|
|
- Calculations and simulation of the ingress/egress cost.
|