JMH Performance Benchmarks: Math vs. StrictMath vs. FastMath (Hipparchus) vs. FastMath (Apache)

Hey, sorry, I thought my changes had been reverted!

I’ll have an MR out to Hipparchus either today or tomorrow :saluting_face:

@afvy the hope is that StrictMath will make the results consistent. So test results might change, but that’s o.k. as long as the differences are small and/or are well understood.

2 Likes

Hey gang, I’m wondering how you’d like me to proceed given the following.

The Plan

For context, plan is/was to first update FastMath to delegate to StrictMath wherever I changed it to delegate to Math, and update tests in Hipparchus and Orekit accordingly. From there we can determine whether results are consistent across different platforms.

The next step is to replace some StrictMath calls with the custom implementations that were present before my first MR in cases where the custom implementations are faster than StrictMath (but slower than Math) (e.g., FastMath::sin, cos, and tan).

The Problem:

Most test breakages are tolerable, but I’ve hit a (potential) bump with StrictMath::exp:

FastMathTest::testExpSpecialCases is failing, and shows that
StrictMath.exp(1.0) == StrictMath.nextUp(Math.E)
rather than
StrictMath.exp(1.0) == Math.E

The difference in double precision: 4 * machEps == 4 * Precision.EPSILON

The difference between the long bits: 1L

(notably, Math.exp(1.0) == Math.E exactly)

No matter how I look at it, this seems like a problem, esp. when we look at the actual values next to a reference (WolframAlpha’s e):

Math.E           : 2.718281828459045
StrictMath.exp(1): 2.7182818284590455
WolframAlpha e   : 2.7182818284590452353...

StrictMath itself delegates to FdLibm.Exp, which says the following:

     * Special cases:
     *      exp(INF) is INF, exp(NaN) is NaN;
     *      exp(-INF) is 0, and
     *      for finite argument, only exp(0)=1 is exact.
     *
     * Accuracy:
     *      according to an error analysis, the error is always less than
     *      1 ulp (unit in the last place).

So we shouldn’t expect it to be exact but, unless I’m misunderstanding, this is more that one ULP of error, right?

Do you (the Orekit and Hipparchus developers) want to go ahead with the initial StrictMath delegation anyway? Should I instead swap to the custom implementation?

Hi Ryan,

thanks for looking into this.
I’m not an expert on these intrinsic functions, but I would say that in doubt, let’s keep the original Hipparchus FastMath implementation, especially on something as used as exponential.

@luc what’s your take?

Cheers,
Romain.

2 Likes

Yes, I think that for now we should stay with what we already have.
We will need to come back to this, but for now, I simply don’t have the time.

I wish I had the bandwidth :disappointed_face:

I’ll get an MR out to switch to StrictMath everywhere except exp, for which I’ll reintroduce the custom Hipparchus impl.

@Serrof FYSA

O.k., revert PR here: Revert faster FastMath by RyanMoser · Pull Request #433 · Hipparchus-Math/hipparchus · GitHub

Some day I’ll find some time to dig in and find out why Math has these problems. For now, let’s just get this reverted so the downstream pipelines can pass.

All diverging tests results depending on OS and JDK have disappeared, corresponding issues have been closed.

So to sum up @Ryan, we’ve managed to delegate some stuff to StrictMath, the benefit being less code to maintain for us, as I understand that there is not really any performance gain like it was the case with Math?

Cheers,
Romain.

1 Like

Hi Romain, not quite, unfortunately.

To put the repos into a passing state again, I simply reverted the commit that switched to Math.

Thus behavior and performance are as they were originally.

In retrospect, I think a much more feasible way to make this switch is to swap over one function at a time.

It’ll take much longer, but this makes it trivial to find the functions that cause irrreproducibilities across different platforms.

E.g., if we swapped to Math::sin and nobody on any platform with any Java version sees a problem, then we know it’s very likely fine a safe switch.

It’s a much more measured approach and, really, the approach I should have taken initially.

2 Likes

Hello, I found this by searching for FastMath, to see where it has ended up. I’m the original author of FastMath, and this discussion of the performance of sin(), cos(), and tan() caught my attention. After digging into it a bit, I think you’ll find that this commit from 2012 is the cause of the the excessive object allocations and poor performance.

Here, the code that I had carefully in-lined to perform the Cody Waite reduction was refactored into a helper class. Unfortunately, this causes object allocations to occur inside these trig functions. Object allocation should be avoided in time-critical code such as this.

I assume this was done to make the code “prettier” long ago.

I respectfully suggest that you revert this change and re-run your tests.

5 Likes

Hi @Bill_Rossi I am reallu happy to see you here!

Thanks a lot for the insight, we will look at it.
The trade-off between performance and maintenance is always tough, but I agree with you here, performances are paramount in this class and indeed due to its complexity, we don’t perform much maintenance on it (and as you pointed out, when we do, we fail).

So +1 to avoid objects allocation.

Thanks a lot for this analysis.

@ryan, do you want to look at it or do you prefer someone else does?

2 Likes

Oh wow, huge! Thank you for this!

@luc I’ll give it a shot!

I’ll see if I can make it static to avoid object instantiation while still avoiding duplication. But if not, reverting to in-line should be pretty safe.

I had tried this back in 2010, and found it was slower. By all means, try it again now. I think the issue was allocating the double[] needed to return the results. I can’t think of a way to avoid the duplication that doesn’t involve a memory allocation.

The comment on the input range is a bit off, it can handle inputs up to PI*2^20.

/** Reduce the input argument using the Cody and Waite method.

This is good for inputs 0.0 < x < 1024.0

Output is remainder after dividing by PI/2

The result array should contain 3 numbers.
result[0] is the integer portion, so mod 4 this gives the quadrant.
result[1] is the upper bits of the remainder
result[2] is the lower bits of the remainder*/

private static void reduceCodyWaite(double x, double result[])
{
  double remA, remB;
  double a, b;
  int k;

  // Estimate k
  k = (int)(x / 1.5707963267948966);

  // Compute remainder
  while (true)
  {
    a = -k * 1.570796251296997;
    remA = x + a;
    remB = -(remA - x - a);

    a = -k * 7.549789948768648E-8;
    b = remA;
    remA = a + b;
    remB += -(remA - b - a);

    a = -k * 6.123233995736766E-17;
    b = remA;
    remA = a + b;
    remB += -(remA - b - a);

    if (remA > 0.0)
      break;

    // Remainder is negative, so decrement k and try again.
    // This should only happen if the input is very close
    // to an even multiple of pi/2
    k--;
  }

  /* Return the results */
  result[0] = k;
  result[1] = remA;
  result[2] = remB;
}

The call to it looked like this

       double reduceResults[] = new double[3];
       reduceCodyWaite(xa, reduceResults);
       quadrant = ((int) reduceResults[0]) & 3;
       xa = reduceResults[1];
       xb = reduceResults[2];

1 Like

Yeah, looking at it again I think I’ve come to the same conclusion before (about the static).

So, I reverted that commit (see in my fork here), and I have some prelim. results from a run on a single machine, and it seems results are similar to the non-in-lined version :confused:

If we look at the original, we see a similar ratio between the Math and FastMath scores (~2:1)

I did a very brief profiling, and it looks like we might be suffering from all of the double[] allocations, which is unrelated to the CW changes.

Caveats:

  • these results are for Java 21, in the original post we were still in Java 17
  • I had to update the combined sinCos, too, but that was nbd