Seriously faster. Algorithm is nearly the same as bilinear, which given the speed of this code, should be considered the baseline of quality. Speed appears to be limited by memory bandwidth, so I didn't bother trying to make it any faster.