Rearrange the oversampled taps in memory to make it easier to use
SIMD instructions on them. this simplifies some sse code.
Add some more optimizations
Improve int16 resampling by using pmaddwd
Use intrinsics to scale and pack int16 samples
Align the coefficients so that we can use aligned loads
Add padding to taps and samples so that we don't have to use partial
loads for the remainder of the loops.
Remove copy_n, we can reuse the plain copy function with some new
parameters.
Align and pad the sample array.