G function outliers during parallel TLE propagation

Hi everyone,

I encountered an issue that is happening on the same code part as Exception during findRoot in EventState - #4 by Christophe (associated gitlab issue : non-bracketing issue due to numerical noise for events at end of interval (#921) · Issues · Orekit / Orekit · GitLab).

In my case, I am running a parallel TLE propagation on a radar-carrying spacecraft to compute visibility events on a significant number of targets (hence the parallel computation). The targets are subdivided into N groups associated with N propagators, each target having its own visibility event and PV provider to compute range and angular offset values.

When computing the events, the value returned by the g function in the evaluateStep method (line 217 of EventState) is sometimes way far from the actual value at the very same state. It is very unlikely that this difference is caused by numerical noise because its order of magnitude is usually way higher than numerical errors, but happens at random. As a sign change is found at this step, this wrongly triggers the subsequent call to findRoot line 222. The g function is evaluated again in findRoot to check for a sign change, but no such switch is found as g returns correct values in this part of the code (again, for the exact same state). The solveInterval method logically returns an exception.

This problem does not always occur - runs sometimes go well and return correct results, but other times stop on the exception mentioned above. As it strongly looks like a variable corruption problem, I tried building 2 propagators outside a loop using proper variables for each of them (as recommended in the ParallelPropagator documentation), but the same issue appeared. My guess is that using PV Providers for each target may end up causing outliers on the g function.

Has someone stumbled upon this same issue by any chance, and has a way to fix it ?

Hi Maxime,

It certainly sounds like a race condition. Are you using the PropagatorsParallelizer or running many TLEPropagators in separate threads? Do you have an example I can run on my box to reproduce the issue. It’s hard to debug without test case.

Regards,
Evan

Hi Evan,

Thanks for your reply !

I considered this possibility but as it turns out, I can’t find the corresponding value when I run the code in debug mode and save all the values taken by g throughout the run, until it returns the exception. So I’m not quite sure it’s a matter of concurrent calculations (if that’s what you mean by race condition, I’m far from being an expert on parallel computing).

I’m using a PropagatorParallelizer where I stack up all my TLE propagators, so I guess it’s equivalent to running them in separate threads. I do use the propagate method on the PropagatorParallelizer.

You’ll find a zip archive of the project attached to this answer. The test case I’m running is src\test\java\MultipleTargetsMonitoring\NThreadParallelPropagation.java. It should run without prior configuration but let me know if you’re having trouble running it. I put 20,000 objects but you can definitely try with something like 10,000 objects and 2 propagators, here is the error I get (in French, sorry about that) :

Exception in thread "main" org.orekit.errors.OrekitException: org.orekit.detection.RadarFieldOfViewDetector@bef2d72 n'a pas r�ussi � trouver une solution entre 2023-01-26T03:54:25.28198637720716Z (g=-4,755879168903914E3) et 2023-01-26T03:54:26.2816292686311Z (g=2,249686619105473E-2) La derni�re it�ration �tait � 2023-01-26T03:54:26.2816292686311Z (g=-4,941592013425141E3) at org.orekit.propagation.events.EventState.findRoot(EventState.java:358) at org.orekit.propagation.events.EventState.evaluateStep(EventState.java:222) at org.orekit.propagation.analytical.AbstractAnalyticalPropagator.acceptStep(AbstractAnalyticalPropagator.java:297) at org.orekit.propagation.analytical.AbstractAnalyticalPropagator.propagate(AbstractAnalyticalPropagator.java:151) at org.orekit.propagation.PropagatorsParallelizer$PropagatorMonitoring.lambda$0(PropagatorsParallelizer.java:395) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1589) Caused by: org.hipparchus.exception.MathIllegalArgumentException: l'intervalle n'encadre pas une racine : f(0E0) = -4,7558794372866805E3, f(999,6428914239388E-3) = -4,941592013425141E3 at org.hipparchus.analysis.solvers.BracketingNthOrderBrentSolver.doSolveInterval(BracketingNthOrderBrentSolver.java:209) at org.hipparchus.analysis.solvers.BracketingNthOrderBrentSolver.solveInterval(BracketingNthOrderBrentSolver.java:427) at org.hipparchus.analysis.solvers.BracketedUnivariateSolver.solveInterval(BracketedUnivariateSolver.java:126) at org.orekit.propagation.events.EventState.findRoot(EventState.java:350) ... 8 more

Thank you for your help !

Regards,

Maxime

spaceborne-radar.7z (3.9 MB)

I was not able to run it (I’m still using Java 8), but this might be problematic:

// Add field of view detector to SBR TLE propagator with step handler
carrierPropagator.addEventDetector(currentRadarDetector);               
                
// Add event logger for further analyses
carrierPropagator.addEventDetector(logger.monitorDetector(currentRadarDetector));

It would be non-deterministic which of the two event detectors would be triggered first for each event. If there is any noise (e.g. numerical noise) in the computation of the g function then you could get a non-bracketing exception.

It still would be useful to run it to see exactly how it is failing and if anything can be done about it. Packaging the test as a JUnit test case (with supporting resource files) that I can drop into orekit’s src/test/ would be the easiest.

I may have found a workaround but I still can’t explain the random variations I get.

Line 217, the value of g is recomputed at a state corresponding to the end of the considered interval :

// evaluate handler value at the end of the substep
            final AbsoluteDate tb = (i == n - 1) ? t1 : t0.shiftedBy((i + 1) * h);
            final double gb = g(interpolator.getInterpolatedState(tb));

It seems like when the state is retrieved using detector.getInterpolatedState(tb), it is sometimes different than the state obtained when it is computed again in the findRoot method. If I call the interpolation twice in evaluate step, I also find this difference between these two calls. I have been trying to find out why the state returned can vary so much that it completely changes the value of g, but I can’t seem to figure out why.

As you say, the g function is quite noise-sensitive especially for cases where the offset is close to 0. This makes evaluateStep sometimes believe that there is an actual root, which isn’t found by findRoot later on.

I noticed that the problem does not appear for the state at the beginning of the interval. In fact, instead of recomputing g at this state in evaluateStep, its value is retrieved from an attribute of the EventState class (called g0). I guess recomputing the state in this function triggers some random errors, do you know what could cause that kind of behavior ? I tried removing the events logger and running the same test to avoid any concurrent triggers between the event detectors, but I still get this problem.

For now I just implemented an additional check on g values at start and end states in findRoot :

        // Check for fake roots
        final double checkGa = detector.g(interpolator.getInterpolatedState(ta));
        final double checkGb = detector.g(interpolator.getInterpolatedState(tb));
        if (FastMath.signum(checkGa) == FastMath.signum(checkGb)) return false; 

This approach works but it still does not explain the variations in the end state returned in evaluateStep.

I’m fairly new to Orekit and JUnit so I’ll look up how to package this test and will drop it on this thread when it’s done.

Thanks again for your help !

Good find. This will also cause the observed exception. See the JavaDoc for g(...) in [1], particularly:

This method is idempotent, that is calling this multiple times with the same state will result in the same value, with two exceptions. First, the definition of the g function may change when an event occurs on this handler, as in the above example. Second, the definition of the g function may change when the eventOccurred method of any other event handler in the same integrator returns Action.RESET_EVENTS, Action.RESET_DERIVATIVES, or Action.RESET_STATE.

[1] ODEEventHandler (Hipparchus 2.3 API)