Isn't that what chipsets and PLX switches do? How did they put something like that in the ssd controller while staying performance and cost competitive?
Disclaimer: I'm a software, not a hardware, developer, but there are some things (algorithmic and/or time complexity) that both domains deal with, so here's my best guess.
It almost certainly boils down to the number of PCIe lanes involved. Aggregating and deaggregating the data stream(s) from/to a lane(s) has processing overhead, the higher that overhead the more PCIe bandwidth you effectively lose (because the data transfer stalls while the agg/deagg processing is ongoing), and the more lanes involved the more processing required - thus the higher the total overhead. That's why Samsung can get away with a relatively simple 4 => 2 or 2 => 4 mapping, whereas enterprise deals with far higher counts of 32 => 16 lanes and vice versa. The latter is the level at which you need to build dedicated hardware to do this processing, such that the overhead and therefore bandwidth loss is as minimal as possible.
It's also why PCIe lane switchers became too expensive to use in desktop applications - as per-lane bandwidth doubles with each new PCIe version, the hardware required to switch those lanes with acceptably low bandwidth losses has necessarily become more and more complex. Then on top of that you have the more and more stringent electrical requirements to support that higher bandwidth, which requires more components, which also increases cost.
Ever-faster data transfer links are a two-edged sword - they're great for consumers, but horrible for engineers. This is also why new versions of PCIe have appeared less frequently than earlier ones - because it's taking longer and longer for the engineers building PCIe devices, to be able to (re-)design their components to be able to adequately handle the ever-increasing amount of bandwidth and lanes.