Egon Elbre 0513d4cf3a docs/blueprints: add async packing

Describe a process for packing small files.

Change-Id: I9f6ad9a85ccf0a92705a47e2ff8efa0072e65230

2022-04-23 06:34:17 +00:00

10 KiB

Raw Blame History

Satellite Side Async Packing

Abstract

This blueprint describes a way to pack small objects without backwards incompatible changes on the uplink side and should be relatively easy to implement.

The rough idea is to have a background process that packs together small object segments into a single segment. Then the objects will refer to a single segment.

Background

Small files have several disadvantages in the current system.

Every small file requires one object and one segment.
Every small piece on storage node uses more disk than needed.
Metabase segment loop has a lot of things to iterate.
There is a significant connection overhead for small files. This design does not improve this.

Design

In metabase database, multiple objects will refer to a single stream (segment), with sub-ranges. This implies that the segments table will now need to contain a reference count among other things.

A packing background process will:

query the objects table one bucket at a time, to discover small objects that haven't been packed together;
download encrypted object data into memory;
rebuild pieces as necessary;
concatenate the pieces together, keeping track where each segment is located with stream_offset and stream_length.
upload the packed segment to a set of storagenodes
replace the stream references in the objects table, with the packed segments

The satellite, on a download request, will add the stream_offset and stream_length to the OrderLimit. The uplink doesn't need to be aware of this change. The storagenode when getting such an order limit, will appropriately only read a subrange from the stored piece.

Note, we should not delete the old segments at step 5., otherwise we might delete a segment that is being actively downloaded. We need to preserve old unpacked segments at least the duration of order limit validity duration (currently 48h). This could be achieved by updating segment TTL-s on the storagenode.

Repair, audit, project accounting may need adjustments. Storage node accounting and GC should work as is.

We also need a process to handle pack fragmentation. For example when 99% of packed segment is "deleted", then we should repack. One option to implement this is to add a process that iterates packed segments and looks at the "garbage_bytes" -- however there's no easy way to mark the "objects" as needing repacking.

The reference counting, of course, would add overhead to all deletes.

Design - Uplink

It would be possible to implement additional API to uplink that uploads packed segments from the start. A very rough implementation could look like:

pack := uplink.BeginPack()
for _, file := range files {
	err := pack.Upload(file.Name, ..., file.Data)
	if err != nil {
		return err
	}
}
pack.Commit()

This is not the final design, it will be revisited when implementing the API.

The main problem with this approach is how to handle going across segment boundaries, since we don't want a 2 byte object to be stored on two different segments.

The easiest way to avoid boundary issue is to force the user to specify the size upfront. e.g. taking []byte as argument or taking a size as an argument for starting the upload.

The alternative is to over-allocate bandwidth and set a limit for split position. e.g. when packed segment is over 32MiB then break whenever the current object is finished. When the object is actually large, then split the segment as usual.

Rationale

It's possible to implement the same packing across all buckets and projects, however this would have signifcant issues with deleting whole buckets and projects. When deleting a bucket or project we would be able to directly delete the segments.

This satellite side packing does have an ingress and egress cost, however it should outweigh the long-term satellite-side storage cost.

Implementation

Storage Node

Add stream_offset and segment_length to pb.OrderLimit. The storage node should respect these values, and uplink should not need to treat such order limits separately.

Satellite

Pack upload selection

For uploading packed segments we should only upload to storage nodes that support the stream_offset feature.

Metabase changes

Update metabase tables:

add stream_offset, stream_length to objects table, to track the location in the stream / segment;
add stream_encrypted_key_nonce, stream_encrypted_key etc. to objects table, to track necessary information for decryption;
add reference_count to segments table, to track how many objects are still referencing a particular segment
add garbage_bytes to segments table, to track how fragmented a given packed segment is.

Object deletes need to update reference_count and garbage_count.

The stream_encrypted_key etc. could also be stored either in the segments table as a separate field. Or even interleaved in the segments themselves. The appropriate location should be tested.

New API will be needed to:

find a batch of objects to pack,
replace the objects with a pack.

The replacing of objects should assume that the replacing the pack partially may fail due to concurrent updates or deletes. The packing should still succeed when most of the replacements succeed.

We need to take care that we don't break dowloads while replacing the streams in objects. Currently, stream_id is being used for different identity purposes.

It's quite likely we'll eventually hit a situation where:

uplink starts downloading object X, with piece id Y
satellite repacks X to a new segment
satellite sends deletes to storagenodes with piece id Y
uplink fails to download object X

The satellite needs to take care that the piece id Y is stored at least until the downloading "token" expires.

Packing Process

The packing process will need to:

query the objects table one bucket at a time, to discover small committed objects that haven't been packed together;
download encrypted object data into memory;
rebuild pieces as necessary;
concatenate the pieces together, keeping track where each segment is located with stream_offset and stream_length.
upload the packed segment to a set of storagenodes
replace the stream references in the objects table, with the packed segments

For choosing expires_at in segments table we can use the maximum of the given segments. The object will still track the appropriate expires_at date and during zombie deletion the garbage_bytes can be updated on the segment as necessary.

When a segment has an encrypted_etag we can handle these in a few possible ways. First, we can ignore them; since the packing only applies to small objects, which usually wouldn't be uploaded via multipart upload. Secondly, since the etag-s are only relevant for multipart upload, we could drop them once we commit (assuming S3 doesn't need them later).

The object packing should take into account:

total size of the pack - choosing too few objects means the packing is less effective,
expires_at time - choosing similar values means the segment needs to hold less garbage,
encrypted object key - choosing objects with the same prefix are more likely to be deleted together,
upload time - things uploaded at similar times are more likely to be deleted at similar times.

None of these are critical for the first implementation, however, we should try to figure out good heuristics for packing to reduce fragmentation.

It would be possible to avoid satellite-side bandwidth overhead by letting storagenodes send the pieces to each other and constructing the packed segment / piece that way. This however has significantly more complicated protocol, as we've seen from graceful exit.

Note, the object packing creates a strong long-tail cancellation and locality preference towards the repacking or repair node. Hence, the repacking and repair nodes need to be more lenient with regards to long-tail cancellation.

Re-packing Process

The re-packing process will need to mark objects as needing repacking.

By inspecting the garbage_bytes and encrypted_size in segments, it's possible to decide whether a segment should be repacked.

TODO: figure out how to mark objects as needing repacking.

Service updates

TODO: review how services need to be updated.

Alternative Solutions

Storagenode Cohorts

One alternative approach would be to keep a "appendable segment" and during upload choose a segment to append to. This avoids the need for a background process.

To implement, it would need some sort of in-memory index to choose appropriate appendable segment, based on the aspects mentioned in "Packing Process". This appropriate selection is not complicated to implement.

However, the failure scenarios are much more difficult to handle correctly.

First the storagenode now needs to have a notion of "segment being modified". This modification can happen during ongoing downloads. Similarly, flushing data to a disk is non-trivial. The current model of "write, flush, commit" without modification helps to avoid many such problems.

The satellite then needs to take into account storage nodes failing. The failure could be due to network issue, node restart, long tail cancellation. When the uplink receives such failure during upload, the whole "packed segment" so far would need to be discarded on that storage node. When the uploads are spaced more apart in time, there's more likely that during each "append" one node fails. Tracking additional information for partial uploads would defeat the purpose of packing. Or in other words -- the failure to upload to a cohort multiplies over time. This means you probably can upload to a cohort some maximum number of times, before it would need to be repaired.

The appendable segments also will need to be audited and repaired, complicating the situation further.

Wrapup

[Who will archive the blueprint when completed? What documentation needs to be updated to preserve the relevant information from the blueprint?]

Open issues

Marking objects for repacking
Concurrency issue with upload and packing
Services that need updating.
Calculations and simulation of the ingress/egress cost.

10 KiB Raw Blame History