Describe a process for packing small files. Change-Id: I9f6ad9a85ccf0a92705a47e2ff8efa0072e65230
10 KiB
Satellite Side Async Packing
Abstract
This blueprint describes a way to pack small objects without backwards incompatible changes on the uplink side and should be relatively easy to implement.
The rough idea is to have a background process that packs together small object segments into a single segment. Then the objects will refer to a single segment.
Background
Small files have several disadvantages in the current system.
- Every small file requires one object and one segment.
- Every small piece on storage node uses more disk than needed.
- Metabase segment loop has a lot of things to iterate.
- There is a significant connection overhead for small files. This design does not improve this.
Design
In metabase database, multiple objects will refer to a single stream (segment), with sub-ranges. This implies that the segments table will now need to contain a reference count among other things.
A packing background process will:
- query the objects table one bucket at a time, to discover small objects that haven't been packed together;
- download encrypted object data into memory;
- rebuild pieces as necessary;
- concatenate the pieces together, keeping track where each segment is located
with
stream_offset
andstream_length
. - upload the packed segment to a set of storagenodes
- replace the stream references in the objects table, with the packed segments
The satellite, on a download request, will add the stream_offset
and
stream_length
to the OrderLimit
. The uplink doesn't need to be aware of this
change. The storagenode when getting such an order limit, will
appropriately only read a subrange from the stored piece.
Note, we should not delete the old segments at step 5., otherwise we might delete a segment that is being actively downloaded. We need to preserve old unpacked segments at least the duration of order limit validity duration (currently 48h). This could be achieved by updating segment TTL-s on the storagenode.
Repair, audit, project accounting may need adjustments. Storage node accounting and GC should work as is.
We also need a process to handle pack fragmentation. For example when 99% of packed segment is "deleted", then we should repack. One option to implement this is to add a process that iterates packed segments and looks at the "garbage_bytes" -- however there's no easy way to mark the "objects" as needing repacking.
The reference counting, of course, would add overhead to all deletes.
Design - Uplink
It would be possible to implement additional API to uplink that uploads packed segments from the start. A very rough implementation could look like:
pack := uplink.BeginPack()
for _, file := range files {
err := pack.Upload(file.Name, ..., file.Data)
if err != nil {
return err
}
}
pack.Commit()
This is not the final design, it will be revisited when implementing the API.
The main problem with this approach is how to handle going across segment
boundaries, since we don't want a 2 byte
object to be stored on two different
segments.
The easiest way to avoid boundary issue is to force the user to specify the size upfront. e.g. taking []byte as argument or taking a size as an argument for starting the upload.
The alternative is to over-allocate bandwidth and set a limit for split position. e.g. when packed segment is over 32MiB then break whenever the current object is finished. When the object is actually large, then split the segment as usual.
Rationale
It's possible to implement the same packing across all buckets and projects, however this would have signifcant issues with deleting whole buckets and projects. When deleting a bucket or project we would be able to directly delete the segments.
This satellite side packing does have an ingress and egress cost, however it should outweigh the long-term satellite-side storage cost.
Implementation
Storage Node
Add stream_offset
and segment_length
to pb.OrderLimit
. The storage node
should respect these values, and uplink should not need to treat such order
limits separately.
Satellite
Pack upload selection
For uploading packed segments we should only upload to storage nodes
that support the stream_offset
feature.
Metabase changes
Update metabase tables:
- add
stream_offset
,stream_length
toobjects
table, to track the location in the stream / segment; - add
stream_encrypted_key_nonce
,stream_encrypted_key
etc. toobjects
table, to track necessary information for decryption; - add
reference_count
tosegments
table, to track how many objects are still referencing a particular segment - add
garbage_bytes
tosegments
table, to track how fragmented a given packed segment is.
Object deletes need to update reference_count
and garbage_count
.
The stream_encrypted_key
etc. could also be stored either in the segments
table as a separate field. Or even interleaved in the segments themselves.
The appropriate location should be tested.
New API will be needed to:
- find a batch of objects to pack,
- replace the objects with a pack.
The replacing of objects should assume that the replacing the pack partially may fail due to concurrent updates or deletes. The packing should still succeed when most of the replacements succeed.
We need to take care that we don't break dowloads while replacing the streams in objects. Currently, stream_id is being used for different identity purposes.
It's quite likely we'll eventually hit a situation where:
- uplink starts downloading object X, with piece id Y
- satellite repacks X to a new segment
- satellite sends deletes to storagenodes with piece id Y
- uplink fails to download object X
The satellite needs to take care that the piece id Y is stored at least until the downloading "token" expires.
Packing Process
The packing process will need to:
- query the objects table one bucket at a time, to discover small committed objects that haven't been packed together;
- download encrypted object data into memory;
- rebuild pieces as necessary;
- concatenate the pieces together, keeping track where each segment is located
with
stream_offset
andstream_length
. - upload the packed segment to a set of storagenodes
- replace the stream references in the objects table, with the packed segments
For choosing expires_at
in segments table we can use the maximum of the given
segments. The object will still track the appropriate expires_at
date and
during zombie deletion the garbage_bytes
can be updated on the segment as
necessary.
When a segment has an encrypted_etag
we can handle these in a few possible
ways. First, we can ignore them; since the packing only applies to small
objects, which usually wouldn't be uploaded via multipart upload. Secondly,
since the etag-s are only relevant for multipart upload, we could drop them
once we commit (assuming S3 doesn't need them later).
The object packing should take into account:
- total size of the pack - choosing too few objects means the packing is less effective,
- expires_at time - choosing similar values means the segment needs to hold less garbage,
- encrypted object key - choosing objects with the same prefix are more likely to be deleted together,
- upload time - things uploaded at similar times are more likely to be deleted at similar times.
None of these are critical for the first implementation, however, we should try to figure out good heuristics for packing to reduce fragmentation.
It would be possible to avoid satellite-side bandwidth overhead by letting storagenodes send the pieces to each other and constructing the packed segment / piece that way. This however has significantly more complicated protocol, as we've seen from graceful exit.
Note, the object packing creates a strong long-tail cancellation and locality preference towards the repacking or repair node. Hence, the repacking and repair nodes need to be more lenient with regards to long-tail cancellation.
Re-packing Process
The re-packing process will need to mark objects
as needing repacking.
By inspecting the garbage_bytes
and encrypted_size
in segments, it's
possible to decide whether a segment should be repacked.
TODO: figure out how to mark objects as needing repacking.
Service updates
TODO: review how services need to be updated.
Alternative Solutions
Storagenode Cohorts
One alternative approach would be to keep a "appendable segment" and during upload choose a segment to append to. This avoids the need for a background process.
To implement, it would need some sort of in-memory index to choose appropriate appendable segment, based on the aspects mentioned in "Packing Process". This appropriate selection is not complicated to implement.
However, the failure scenarios are much more difficult to handle correctly.
First the storagenode now needs to have a notion of "segment being modified". This modification can happen during ongoing downloads. Similarly, flushing data to a disk is non-trivial. The current model of "write, flush, commit" without modification helps to avoid many such problems.
The satellite then needs to take into account storage nodes failing. The failure could be due to network issue, node restart, long tail cancellation. When the uplink receives such failure during upload, the whole "packed segment" so far would need to be discarded on that storage node. When the uploads are spaced more apart in time, there's more likely that during each "append" one node fails. Tracking additional information for partial uploads would defeat the purpose of packing. Or in other words -- the failure to upload to a cohort multiplies over time. This means you probably can upload to a cohort some maximum number of times, before it would need to be repaired.
The appendable segments also will need to be audited and repaired, complicating the situation further.
Wrapup
[Who will archive the blueprint when completed? What documentation needs to be updated to preserve the relevant information from the blueprint?]
Open issues
- Marking objects for repacking
- Concurrency issue with upload and packing
- Services that need updating.
- Calculations and simulation of the ingress/egress cost.