mirror of
https://github.com/flatpak/flatpak.git
synced 2026-01-29 01:51:20 -05:00
This is an optimized version of ostree_repo_prune() specialized for archive mode repos. It is faster and uses less memory so that we can prune larger repos (like flathub) in a realistic timeframe. The primary reason it is faster is that it creates and uses a `.commitmeta2` file for each commit, containing information about what objects are reachable from that commit. This means incremental prunes need only traverse over newly created commits. Secondly, it uses the variant parser compiled accessors for the various GVariants that are involved in the prune which is quite a bit faster, especially if the repo is very large. It also merges the scan-for-all-objects and prune-unreachable objects phases, which means that we don't have to allocate a hashtable for all the objects in the entire repo saving a lot of memory. To save memory the hashtable of reachable objects, which can be quite big on a big repo, points to a custom, very compact format for object names. Additionally it does the scanning for reachable objects twice, first with a shared lock and then again (if anything changed) it with an exclusive lock. This allows us to avoid using an exclusive lock during the slowest part of the prune. Unfortunately there are currently no public APIs for the ostree repo locks. We really need to take an exclusive lock during the whole prune or we parallel modifications (say a commit) might get their newly written objects deleted. To work around this we have a minimal custom implementation of an exclusive lock. Once the public API is available we can start using that. I created a repo with a lot of small commits to test this. It has 9M, and pruning with depth=10 deletes 2M of them. The original performance looks like: Finding reachable objects: 287 seconds Pruning unreachable: 69 seconds Just using the pregenerated reachable data: Finding reachable objects: 15 seconds Pruning unreachable: 69 seconds The final optimized prune (using pregenerated data): Finding reachable objects: 12 seconds Pruning unreachable: 51 seconds The above are with the page caches cleaned, on a second run the performance increase is even more noticeable. As a comparison to the above, finding the reachable objects in the actual flathub repo took 22 hours, but with the pregenerated reachable data only 39 minutes.