Convert many FP divides to FP multiplies
Through hotspot analysis, found opportunities to speed up computation by pre-computing inverse constants and turning divides into multiplies. Most CPUs take longer to compute FP divides compared to FP multiplies. The gains will be depending on the microarchitecture of your target machine. I see about +4% on my Ampere Altra aarch64 machine, depending on the inputs.
Please make sure these are safe optimizations. My SPEC CPU inputs verify correctly and still give the same answers.