Data Context Proposal

Data Context Proposal

Based on community discussion there is a desire to add the concept of a data context to Orekit that would manage leap seconds, EOP, and everything else that DataProvidersManager is used for.[1-4] My goal here is to consolidate ideas and provide a concrete plan for the community to discuss and provide feedback. I’ll start implementing the changes over the next few weeks.

Motivation

Adding a data context would enable some new use cases:

  1. Updating leap seconds, EOP, etc. without restarting the JVM. [1]
  2. Compare multiple EOP data sets within the same JVM.

Updating EOP in a running multi-threaded application is a bit tricky. If the data were updated at an arbitrary point in time this could create inconsistencies leading to incorrect results. Knowing when it is safe to update the data requires application level knowledge which the Orekit library does not possess. So Orekit can provide methods to update the data, but the application has the responsibility for calling them at an appropriate time.

Allowing multiple data contexts enables the second use case and provides flexible options for implementing the data update use case. For example, the application could continue to use the old data set for processing jobs (e.g. threads) that were already started to avoid inconsistencies, but use the new, updated data context for new processing jobs. This would allow a high level of concurrency and a gradual switch over to the updated data set.

The existing architecture has a long track record of providing sufficient utility for a variety of use cases and has some advantages compared to managing multiple data contexts:

  • Simple to set up.
  • Consistency throughout application.

Consistency is valuable and eliminates a whole class of bugs. It means that an AbsoluteDate corresponds to a single point in TAI and a single point in UTC. If there are multiple instances of UTCScale this is no longer the case as each UTCScale could map an AbsoluteDate to a different point in UTC depending on the leap seconds it has loaded. Consistency is limiting as it becomes impossible to characterized the differences between data sets, one of the new use cases. My conclusion is that out of the box Orekit should be consistent, but allow the user the power to configure multiple data contexts.

Plan

  1. Create a DataContext that provides access to frames, time scales, and other auxiliary data. DataContext would be initialized with a a reference to a DataProvidersManager, which would no longer be a singleton. DataContext would create instances of FramesFactory, TimeScalesFactory, etc. which would no longer be singletons.

  2. Create a default DataContext singleton that matches the existing behavior of Orekit. This provides consistency and simple setup to simple applications.

  3. Add additional contructors/methods to every piece of existing Orekit code that calls methods in FramesFactory, TimeScalesFactory, etc. Existing methods would use the default DataContext and the added methods would accept a DataContext or the specific object needed, e.g. UTCScale.

That would comprise the initial capability that would enable the new use cases while maintaining the existing behaviors for users that do not create their own DataContext. This plan as a UML diagram is shown in below. This plan is based on the one described in [4].

Improvements

While the proposal above would satisfy the stated use cases there are some more use cases that could be added as future improvements.

Frame transformations between contexts

In the base proposal the GCRF frame would be the same in all data contexts because it is the root frame in Orekit. This isn’t necessarily what the user wants. For example, when comparing OD with measurements from ground stations and different EOP it is more realistic to assume that the Earth fixed frames are the same (ground stations didn’t move).

This could probably be implemented by creating a FramesFactory constructor that takes a Frame from another data context and a Predefined selecting a frame in this data context. Since the choice of root frame is arbitrary other frames would then be constructed relative to the selected frame. Would require a significant update to the frame creation code to allow building the tree from either direction, e.g. from GCRF to ITRF or from ITRF to GCRF.

Sharing data between contexts

Sometimes users may only want to reuse part of an existing data context when creating a new one. For example, only update the EOP but use the same leap second file. Under the basic proposal users could do this by setting the same data provider for the leap second file in both data contexts, but multiple copies of the leap second file will then be loaded and stored in memory. A memory optimization could be to create a way for the user to reuse specific data sets from one context to another. Would probably need to reuse instances, e.g. UTCScale, or provide methods to get the underlying data table, e.g. UTCScale.getLeapSeconds(), or cache the data in the providers instead of the factories as suggested in [5].

Use Java’s Service Provider Interface

As suggested in [5] we could add a method for Orekit to detect data loaders using Java’s ServiceLoader capability. This could replace or augment the addEOPHistoryLoader(), addUTCTAIOffsetsLoader(), addProvider(), and addFilter() families of methods. Could simplify configuration for users when reading EOP formats that Orekit does not support natively, e.g. [6].

References

[1] clearFactories() method from test class Utils
[2] https://orekit.org/doc/orekit-day/2019/3%20-%20Quartz%20FDS%20presentation%20for%20Orekit%20Day%20-%20Airbus%20DS.pdf
[3] Thank you for Orekit Day 2019!
[4] https://www.orekit.org/mailing-list-archives/orekit-developers/msg00085.html
[5] https://www.orekit.org/mailing-list-archives/orekit-developers/msg00084.html
[6] http://maia.usno.navy.mil/ser7/mark3.out

Hi @evan.ward,

Your proposal is great!

You seem to have already given a lot of thought to it. I am eager to see what @yannick will say about this as he also has similar needs.

If I understand well, when we want to update already updated data, we set up a new context and create new objects that will use this new context, we do not update the data in already created objects, did I understand correctly? This would fit very well in the unit tests too and would allow to remove the ugly hacks based on reflection API.

What I don’t understand is how the HeritageDataContext is used. Is it activated automatically under the hood if no other context has been set up or do users have to enable it somehow. In other words, would existing applications that just set up DataProvidersManager run just as before or would they require a change? If so, the modification should be considered for 11.0, if not we could add it earlier.

1 Like

Looks interesting. I’m not sure I’m following the Case #1 flow if I had existing “legacy” default references.

Yes, that is my plan. That way we can still use immutable objects for synchronization, and reuse much of the existing code as is. Updating a data context in place seemed to me to be more error prone that creating a separate data context.

My plan at this point is to make it compatible with 10.0, and use a separate class, HeritageDataContext to match any idiosyncrasies of the current implementation. It’s singleton will be returned by DataContext.getDefault(). All of the current static methods that are used to configure data loading or retrieve loaded data would be delegated to that class. So I’m targeting 10.1 for the first release of this feature. At this point I would like to leave open the option of merging for 11.0 in case it proves to be too technically challenging to maintain compatibility.

If you happy with how it works now then you shouldn’t need to update anything. If you want to take advantage of the new features then you would need to update your code to explicitly handle a DataContext. In this proposal data would not be updated in place. E.g. DataContext.getDefault().getTimeScalesFactory().getUTC() would always have the same set of leap seconds for the life of the application. Using updated data in an application would consist of creating a new DataContext that loads the updated data files, and then using that new data context in your code.

I considered updating data in place, similar to how the unit tests currently work, but decided not to pursue that route for a few reasons:

  • Could no longer use immutability for simple thread safety.
  • Overhead of additional synchronization synchronization.
  • It would be difficult to make such an update atomic.
  • Has a global effect, which means the call would have to “own” the whole application. Plus I would like to move away from using global variables.

That said I would like to hear from the community. If updating data in place is a feature the community wants I can think more about how to implement it.

Regards,
Evan

1 Like

Hi all,

Thank you @evan.ward for this ! There is obviously a lot of thought put into this proposition. I will need some time to understand all the implications of the architecture that you describe, but it certainly seems promising. As @luc mentionned, I am indeed very interested by this feature.

My use case

I would like to work with potentially different data sets on a per-thread basis. I have a server application running, spawning threads as required when computations are requested. Right now, I have to cope up with a limitation: all computations must share the same set of data. To change the data I must restart the server.

Another potential approach

To lift this limitation, I have begun exploring a different path, which I will attempt to describe. Please keep in mind that I have little experience with concurrency, and I do not know the internals of Orekit very well, so my proposition should be examined with a skeptical eye. I’ve done some ugly prototyping and it seems promising, but it is still too early to be sure it will work properly in all cases.

ThreadLocal variables

The basic idea would be to replace model data static variables by java.lang.ThreadLocal instances.
These are basically “thread-static” variables that have a single value for a given thread, whereas the usual “static” keyword enforces the consistency of the value for the entire process.

Separate data contexts per thread

So if we replace the static variables by thread-local ones, we should be able to fork a new thread, clear the data (without affecting other threads) and load the desired data in its place. I think this is equivalent to updating the data in place, as mentionned by Evan, but thread-by-thread instead of the whole application, so should be less prone to undesired behaviour.

Switching the data context within a thread

This change should also cover the use case where the user wants to perform sequential computations in the same thread, changing the data in-between. Clearing the data and setting up new DataLoaders before the next computation should do the trick.

InheritableThreadLocal

I believe a good optimization would be to use InheritableThreadLocal, so a newly forked thread is initialized with the values from its parent thread. For use cases where many threads are spawned, and will all use the same data, this architecture should have performances near equal to the current version of Orekit. Since this is probably a frequent use case, it seems important to keep it efficient.

This should also work for threads that share some (but not all) data. The common data could be loaded before forking the threads, to improve performance (but maybe not memory usage: depending on how InheritableThreadLocal is implemented, data might be duplicated).

Keeping track of all data

To be able to clear all cached data easily, I think we could centralize references to all instances of thread-local variables as described in the following architecture.


The idea is to have a custom class that inherits from ThreadLocal, with a constructor that automatically registers the instance in a dedicated singleton. Via the singleton, it is then possible to clear all data associated with a thread. This should have roughly the same effect as the clearFactories() method from the test class Utils, while being easier to maintain (the list of instances to clear will be built at runtime, so there should be no need to update this method when the rest of the code changes).

Since the data will be cleared only for the thread calling the clear method, I do not expect issues with data disappearing while being used. Because the thread is currently calling the clear method, it cannot perform computations at the same time.

Some thoughts

At a glance, I think that this approach would be less ambitious than the first one, covering maybe fewer use cases. It introduces a strong coupling between the threads and the data contexts, to the point where there is no real equivalent of the DataContext object from the first proposal. This might be a bad thing. But I also think it would require a bit less work, because we would not need to duplicate the contructors/methods for every class of Orekit that uses model-related data.

However there is still a significant amount of work to do: ThreadLocal is a wrapper class, not a keyword like “static”. So the refactoring is quite significant: instead of just using a static variable, we have to get() the value of the thread-local variable everywhere it is used. And it will be hard to ensure that no static variable used for a data context has been forgotten.

On the bright side, I believe we can hide this change behind the public API to make it backward-compatible with current Orekit-based applications. If the user does not change his code, all data will simply be stored automatically in the context associated with the main thread.

Thanks again Evan for this initiative. When the plan is finalized, if you want to share the load of developing this feature, I’m willing to help however I can.

Hi @yannick,

Thanks for presenting a different approach. I have a similar use case where I’m trying to get a HTTP server to use different auxiliary data depending on the request.

Similarities

In many cases both approaches provide equivalent functionality. As you mentioned one could switch data contexts at will using OrekitDataCacheManager.clearAllForCurrentThread(), which provides similar capability to explicitly specifying the DataContext for a particular computation. Also an application that uses a global ThreadLocal<DataContext> would behave similarly to the design you propose. The InheritableThreadLocal optimization provides similar functionality to what I described in “Sharing data between contexts”. I mention these similarities to show that both proposals address a majority of the use cases and the differences will be on the margin.

Differences

As you mentioned the biggest difference is whether cached data is tied directly to the current thread.

I try to avoid ThreadLocal because “Each thread holds an implicit reference to its copy of a thread-local variable as long as the thread is alive and the ThreadLocal instance is accessible”.[1] In other words a static ThreadLocal behaves as a field declared in Thread. This can cause two problems when using a thread pool such as an ExecutorService or the Jetty HTTP server.[2] First is that when a task starts it is never sure what the values of the ThreadLocals are, so each task would always have to start by calling clearAllForCurrentThread(). Second is that the memory used by a ThreadLocal is not garbage collected even after the application is done with it, potentially causing a memory leak. A workaround would be that each task is wrapped in try {...} finally {OrekitDataCacheManager.clearAllForCurrentThread()} to ensure that the memory can be garbage collected. All the loading and unloading of data takes a non-trivial amount of time. My impression is that ThreadLocal is convenient when each thread is used for a single operation, but causes headaches when threads are reused via pools or for asynchronous I/O. Googling “ThreadLocal memory leak” shows that most Java application servers can have issues with ThreadLocal.

Another difference is that if the data context is tied to a thread then calculations involving multiple data contexts become hard. For example, if one wanted to compute the transform from ITRF based on rapid data to ITRF based on final data then two data sets would need to be loaded in the same thread at the same time for the transform to be calculated. If a data context was tied to a thread one would have to perform half the calculation, switch to a different thread, perform the other half of the calculation, and then combine the results.

The other big difference is whether data is updated in existing objects. For example, does a given instance of UTCScale always refer to the same UTC time scale (with the same leap seconds), or does a UTCScale instance represent different realizations of UTC (different numbers of leap seconds) at different points in the application depending on the current thread and point in time?

Thoughts

I agree with @yannick’s conclusion that using ThreadLocal would be easier to implement while maintaining backwards compatibility. I prefer having separate DataContext objects that are not tied to a Thread for the reasons outlined above.

[1] https://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.html
[2] https://www.eclipse.org/jetty/documentation/current/architecture.html

I agree with your reasoning here. I consider the live update a bit of an edge case which is already addressable by simply restarting the application/service. Even without legacy concerns there is a usability concern which I think your proposal addresses. The case you mentioned which is needing to have multiple leap seconds or other data loaded concurrently is an interesting case I hadn’t thought of and this proposal addresses that without introducing usability issues for mainstream usage.

Thank you for this explanation. I think you may be right !

However I am still a bit concerned with the “dual API” approach. In a complex application that wants to make use of this new DataContext feature, it seems a bit too easy to forget adding the extra DataContext argument on some API calls. This would result in inconsistent results (due to the computations without the extra argument being performed using the default context) that could be extremely hard to spot.

I think it may be wise to provide an easy way to deactivate the default DataContext. If the application deactivates the default context, API calls without the extra DataContext argument could, for instance, throw an exception. This would allow to spot errors at runtime (ideally, when unit-testing the application) instead of computing with inconsistent data. Clearing all data providers from the default context could maybe achieve this, although it may be better to have something more explicit than that.

Another idea could be to have an API that allows a thread to activate a DataContext, which would then be used for all computations (for this thread only) without having to pass the DataContext argument for all method calls. It would basically replace the default context for this thread. This could ensure consistency while also freeing the user from the burden of carrying the DataContext around his entire application. However, if badly used, I have a feeling that this could also lead to weird bugs…

+1 That’s a good idea for finding hard to detect bugs. Perhaps even allowing the user to set the default data context, so they can use a different implementation if they would like.

I think we could enable this functionality by allowing the user to set the default data context. For example, if the following object was set as the default data context then it would allow using the static API to retrieve different data depending on the current thread. I’m -0 for adding this to the library because as you mention it has the potential to create some weird bugs, but one of the nice things about DataContext being an interface is that users can create their own implementations for their specific application.

ThreadLocalDataContext implements DataContext {
    ThreadLocal<DataContex> delegate;
    // delegate all methods to delegate.get()
}

Yes, that is certainly what many people will end up doing anyway, so I’m fine with having the code stub that you wrote inside my application instead of inside Orekit.

I created #607 to track implementation of this proposal. I’ll get started and push to a new branch when I have a minimum implementation. At that point I would appreciate the community testing and review it to provide feedback early in the development cycle.