Well, regarding the "based on something", you can hand off a list of packages to createrepo_c with --pkglist, and avoid the need to download files with --update + --skip-stat. Unfortunately that doesn't help you with the package file management. In a vacuum --baseurl would help here because you could have one root directory, however in reality it breaks repository mirroring because any mirror be telling clients to fetch the packages from the source-of-truth.
I'm not 100% sure how --basedir works, the description is a bit vague.
Another option is to use something like Pulp which stores all the information required for metadata generation inside Postgresql and thus can do so without ever touching the packages / headers again. That approach isn't necessarily free of downsides either, but it does abstract the whole file management problem.
I think we need to begin by figuring out what happens in at least the 'pungi' part of the daily 'let's make updates and rawhide'. There are a LOT of things which are going on which interrelate with each other and are prone to cascading breakage when something is 'added', 'removed', 'fixed', or 'changed'. There are also some hard resource limitations on how much CPU/disk/memory is available, how close things must be to 'work', and places where things break a lot but trying to remove/fix/change would require longer downtimes than the project has allowed in the past.
I say this from having done all of the above at one point or another and caused all kinds of chaos in doing so. I have probably used up more of Kevin's patience on those than was right.