Data Context Proposal

evan.ward · September 18, 2019, 6:03pm

Data Context Proposal

Based on community discussion there is a desire to add the concept of a data context to Orekit that would manage leap seconds, EOP, and everything else that DataProvidersManager is used for.[1-4] My goal here is to consolidate ideas and provide a concrete plan for the community to discuss and provide feedback. I’ll start implementing the changes over the next few weeks.

Motivation

Adding a data context would enable some new use cases:

Updating leap seconds, EOP, etc. without restarting the JVM. [1]
Compare multiple EOP data sets within the same JVM.

Updating EOP in a running multi-threaded application is a bit tricky. If the data were updated at an arbitrary point in time this could create inconsistencies leading to incorrect results. Knowing when it is safe to update the data requires application level knowledge which the Orekit library does not possess. So Orekit can provide methods to update the data, but the application has the responsibility for calling them at an appropriate time.

Allowing multiple data contexts enables the second use case and provides flexible options for implementing the data update use case. For example, the application could continue to use the old data set for processing jobs (e.g. threads) that were already started to avoid inconsistencies, but use the new, updated data context for new processing jobs. This would allow a high level of concurrency and a gradual switch over to the updated data set.

The existing architecture has a long track record of providing sufficient utility for a variety of use cases and has some advantages compared to managing multiple data contexts:

Simple to set up.
Consistency throughout application.

Consistency is valuable and eliminates a whole class of bugs. It means that an AbsoluteDate corresponds to a single point in TAI and a single point in UTC. If there are multiple instances of UTCScale this is no longer the case as each UTCScale could map an AbsoluteDate to a different point in UTC depending on the leap seconds it has loaded. Consistency is limiting as it becomes impossible to characterized the differences between data sets, one of the new use cases. My conclusion is that out of the box Orekit should be consistent, but allow the user the power to configure multiple data contexts.

Plan

Create a DataContext that provides access to frames, time scales, and other auxiliary data. DataContext would be initialized with a a reference to a DataProvidersManager, which would no longer be a singleton. DataContext would create instances of FramesFactory, TimeScalesFactory, etc. which would no longer be singletons.
Create a default DataContext singleton that matches the existing behavior of Orekit. This provides consistency and simple setup to simple applications.
Add additional contructors/methods to every piece of existing Orekit code that calls methods in FramesFactory, TimeScalesFactory, etc. Existing methods would use the default DataContext and the added methods would accept a DataContext or the specific object needed, e.g. UTCScale.

That would comprise the initial capability that would enable the new use cases while maintaining the existing behaviors for users that do not create their own DataContext. This plan as a UML diagram is shown in below. This plan is based on the one described in [4].

Improvements

While the proposal above would satisfy the stated use cases there are some more use cases that could be added as future improvements.

Frame transformations between contexts

In the base proposal the GCRF frame would be the same in all data contexts because it is the root frame in Orekit. This isn’t necessarily what the user wants. For example, when comparing OD with measurements from ground stations and different EOP it is more realistic to assume that the Earth fixed frames are the same (ground stations didn’t move).

This could probably be implemented by creating a FramesFactory constructor that takes a Frame from another data context and a Predefined selecting a frame in this data context. Since the choice of root frame is arbitrary other frames would then be constructed relative to the selected frame. Would require a significant update to the frame creation code to allow building the tree from either direction, e.g. from GCRF to ITRF or from ITRF to GCRF.

Sharing data between contexts

Sometimes users may only want to reuse part of an existing data context when creating a new one. For example, only update the EOP but use the same leap second file. Under the basic proposal users could do this by setting the same data provider for the leap second file in both data contexts, but multiple copies of the leap second file will then be loaded and stored in memory. A memory optimization could be to create a way for the user to reuse specific data sets from one context to another. Would probably need to reuse instances, e.g. UTCScale, or provide methods to get the underlying data table, e.g. UTCScale.getLeapSeconds(), or cache the data in the providers instead of the factories as suggested in [5].

Use Java’s Service Provider Interface

As suggested in [5] we could add a method for Orekit to detect data loaders using Java’s ServiceLoader capability. This could replace or augment the addEOPHistoryLoader(), addUTCTAIOffsetsLoader(), addProvider(), and addFilter() families of methods. Could simplify configuration for users when reading EOP formats that Orekit does not support natively, e.g. [6].

References

[1] clearFactories() method from test class Utils
[2] https://orekit.org/doc/orekit-day/2019/3%20-%20Quartz%20FDS%20presentation%20for%20Orekit%20Day%20-%20Airbus%20DS.pdf
[3] Thank you for Orekit Day 2019! - #5 by evan.ward
[4] Re: [Orekit Developers] [Orekit Users] design of the DataProviders
[5] RE: [Orekit Developers] [Orekit Users] design of the DataProviders
[6] http://maia.usno.navy.mil/ser7/mark3.out

luc · September 19, 2019, 3:02pm

Hi @evan.ward,

Your proposal is great!

You seem to have already given a lot of thought to it. I am eager to see what @yannick will say about this as he also has similar needs.

If I understand well, when we want to update already updated data, we set up a new context and create new objects that will use this new context, we do not update the data in already created objects, did I understand correctly? This would fit very well in the unit tests too and would allow to remove the ugly hacks based on reflection API.

What I don’t understand is how the HeritageDataContext is used. Is it activated automatically under the hood if no other context has been set up or do users have to enable it somehow. In other words, would existing applications that just set up DataProvidersManager run just as before or would they require a change? If so, the modification should be considered for 11.0, if not we could add it earlier.

hankgrabowski · September 19, 2019, 11:40pm

Looks interesting. I’m not sure I’m following the Case #1 flow if I had existing “legacy” default references.

evan.ward · September 20, 2019, 12:23pm

Yes, that is my plan. That way we can still use immutable objects for synchronization, and reuse much of the existing code as is. Updating a data context in place seemed to me to be more error prone that creating a separate data context.

My plan at this point is to make it compatible with 10.0, and use a separate class, HeritageDataContext to match any idiosyncrasies of the current implementation. It’s singleton will be returned by DataContext.getDefault(). All of the current static methods that are used to configure data loading or retrieve loaded data would be delegated to that class. So I’m targeting 10.1 for the first release of this feature. At this point I would like to leave open the option of merging for 11.0 in case it proves to be too technically challenging to maintain compatibility.

If you happy with how it works now then you shouldn’t need to update anything. If you want to take advantage of the new features then you would need to update your code to explicitly handle a DataContext. In this proposal data would not be updated in place. E.g. DataContext.getDefault().getTimeScalesFactory().getUTC() would always have the same set of leap seconds for the life of the application. Using updated data in an application would consist of creating a new DataContext that loads the updated data files, and then using that new data context in your code.

I considered updating data in place, similar to how the unit tests currently work, but decided not to pursue that route for a few reasons:

Could no longer use immutability for simple thread safety.
Overhead of additional synchronization synchronization.
It would be difficult to make such an update atomic.
Has a global effect, which means the call would have to “own” the whole application. Plus I would like to move away from using global variables.

That said I would like to hear from the community. If updating data in place is a feature the community wants I can think more about how to implement it.

Regards,
Evan

yannick · September 20, 2019, 3:28pm

Hi all,

Thank you @evan.ward for this ! There is obviously a lot of thought put into this proposition. I will need some time to understand all the implications of the architecture that you describe, but it certainly seems promising. As @luc mentionned, I am indeed very interested by this feature.

My use case

I would like to work with potentially different data sets on a per-thread basis. I have a server application running, spawning threads as required when computations are requested. Right now, I have to cope up with a limitation: all computations must share the same set of data. To change the data I must restart the server.

Another potential approach

To lift this limitation, I have begun exploring a different path, which I will attempt to describe. Please keep in mind that I have little experience with concurrency, and I do not know the internals of Orekit very well, so my proposition should be examined with a skeptical eye. I’ve done some ugly prototyping and it seems promising, but it is still too early to be sure it will work properly in all cases.

ThreadLocal variables

The basic idea would be to replace model data static variables by java.lang.ThreadLocal instances.
These are basically “thread-static” variables that have a single value for a given thread, whereas the usual “static” keyword enforces the consistency of the value for the entire process.

Separate data contexts per thread

So if we replace the static variables by thread-local ones, we should be able to fork a new thread, clear the data (without affecting other threads) and load the desired data in its place. I think this is equivalent to updating the data in place, as mentionned by Evan, but thread-by-thread instead of the whole application, so should be less prone to undesired behaviour.

Switching the data context within a thread

This change should also cover the use case where the user wants to perform sequential computations in the same thread, changing the data in-between. Clearing the data and setting up new DataLoaders before the next computation should do the trick.

InheritableThreadLocal

I believe a good optimization would be to use InheritableThreadLocal, so a newly forked thread is initialized with the values from its parent thread. For use cases where many threads are spawned, and will all use the same data, this architecture should have performances near equal to the current version of Orekit. Since this is probably a frequent use case, it seems important to keep it efficient.

This should also work for threads that share some (but not all) data. The common data could be loaded before forking the threads, to improve performance (but maybe not memory usage: depending on how InheritableThreadLocal is implemented, data might be duplicated).

Keeping track of all data

To be able to clear all cached data easily, I think we could centralize references to all instances of thread-local variables as described in the following architecture.

The idea is to have a custom class that inherits from ThreadLocal, with a constructor that automatically registers the instance in a dedicated singleton. Via the singleton, it is then possible to clear all data associated with a thread. This should have roughly the same effect as the clearFactories() method from the test class Utils, while being easier to maintain (the list of instances to clear will be built at runtime, so there should be no need to update this method when the rest of the code changes).

Since the data will be cleared only for the thread calling the clear method, I do not expect issues with data disappearing while being used. Because the thread is currently calling the clear method, it cannot perform computations at the same time.

Some thoughts

At a glance, I think that this approach would be less ambitious than the first one, covering maybe fewer use cases. It introduces a strong coupling between the threads and the data contexts, to the point where there is no real equivalent of the DataContext object from the first proposal. This might be a bad thing. But I also think it would require a bit less work, because we would not need to duplicate the contructors/methods for every class of Orekit that uses model-related data.

However there is still a significant amount of work to do: ThreadLocal is a wrapper class, not a keyword like “static”. So the refactoring is quite significant: instead of just using a static variable, we have to get() the value of the thread-local variable everywhere it is used. And it will be hard to ensure that no static variable used for a data context has been forgotten.

On the bright side, I believe we can hide this change behind the public API to make it backward-compatible with current Orekit-based applications. If the user does not change his code, all data will simply be stored automatically in the context associated with the main thread.

Thanks again Evan for this initiative. When the plan is finalized, if you want to share the load of developing this feature, I’m willing to help however I can.

evan.ward · September 20, 2019, 8:40pm

Hi @yannick,

Thanks for presenting a different approach. I have a similar use case where I’m trying to get a HTTP server to use different auxiliary data depending on the request.

Similarities

In many cases both approaches provide equivalent functionality. As you mentioned one could switch data contexts at will using OrekitDataCacheManager.clearAllForCurrentThread(), which provides similar capability to explicitly specifying the DataContext for a particular computation. Also an application that uses a global ThreadLocal<DataContext> would behave similarly to the design you propose. The InheritableThreadLocal optimization provides similar functionality to what I described in “Sharing data between contexts”. I mention these similarities to show that both proposals address a majority of the use cases and the differences will be on the margin.

Differences

As you mentioned the biggest difference is whether cached data is tied directly to the current thread.

I try to avoid ThreadLocal because “Each thread holds an implicit reference to its copy of a thread-local variable as long as the thread is alive and the ThreadLocal instance is accessible”.[1] In other words a static ThreadLocal behaves as a field declared in Thread. This can cause two problems when using a thread pool such as an ExecutorService or the Jetty HTTP server.[2] First is that when a task starts it is never sure what the values of the ThreadLocals are, so each task would always have to start by calling clearAllForCurrentThread(). Second is that the memory used by a ThreadLocal is not garbage collected even after the application is done with it, potentially causing a memory leak. A workaround would be that each task is wrapped in try {...} finally {OrekitDataCacheManager.clearAllForCurrentThread()} to ensure that the memory can be garbage collected. All the loading and unloading of data takes a non-trivial amount of time. My impression is that ThreadLocal is convenient when each thread is used for a single operation, but causes headaches when threads are reused via pools or for asynchronous I/O. Googling “ThreadLocal memory leak” shows that most Java application servers can have issues with ThreadLocal.

Another difference is that if the data context is tied to a thread then calculations involving multiple data contexts become hard. For example, if one wanted to compute the transform from ITRF based on rapid data to ITRF based on final data then two data sets would need to be loaded in the same thread at the same time for the transform to be calculated. If a data context was tied to a thread one would have to perform half the calculation, switch to a different thread, perform the other half of the calculation, and then combine the results.

The other big difference is whether data is updated in existing objects. For example, does a given instance of UTCScale always refer to the same UTC time scale (with the same leap seconds), or does a UTCScale instance represent different realizations of UTC (different numbers of leap seconds) at different points in the application depending on the current thread and point in time?

Thoughts

I agree with @yannick’s conclusion that using ThreadLocal would be easier to implement while maintaining backwards compatibility. I prefer having separate DataContext objects that are not tied to a Thread for the reasons outlined above.

[1] https://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.html
[2] https://www.eclipse.org/jetty/documentation/current/architecture.html

hankgrabowski · September 21, 2019, 12:09am

I agree with your reasoning here. I consider the live update a bit of an edge case which is already addressable by simply restarting the application/service. Even without legacy concerns there is a usability concern which I think your proposal addresses. The case you mentioned which is needing to have multiple leap seconds or other data loaded concurrently is an interesting case I hadn’t thought of and this proposal addresses that without introducing usability issues for mainstream usage.

yannick · September 22, 2019, 7:07pm

Thank you for this explanation. I think you may be right !

However I am still a bit concerned with the “dual API” approach. In a complex application that wants to make use of this new DataContext feature, it seems a bit too easy to forget adding the extra DataContext argument on some API calls. This would result in inconsistent results (due to the computations without the extra argument being performed using the default context) that could be extremely hard to spot.

I think it may be wise to provide an easy way to deactivate the default DataContext. If the application deactivates the default context, API calls without the extra DataContext argument could, for instance, throw an exception. This would allow to spot errors at runtime (ideally, when unit-testing the application) instead of computing with inconsistent data. Clearing all data providers from the default context could maybe achieve this, although it may be better to have something more explicit than that.

Another idea could be to have an API that allows a thread to activate a DataContext, which would then be used for all computations (for this thread only) without having to pass the DataContext argument for all method calls. It would basically replace the default context for this thread. This could ensure consistency while also freeing the user from the burden of carrying the DataContext around his entire application. However, if badly used, I have a feeling that this could also lead to weird bugs…

evan.ward · September 23, 2019, 12:13pm

+1 That’s a good idea for finding hard to detect bugs. Perhaps even allowing the user to set the default data context, so they can use a different implementation if they would like.

I think we could enable this functionality by allowing the user to set the default data context. For example, if the following object was set as the default data context then it would allow using the static API to retrieve different data depending on the current thread. I’m -0 for adding this to the library because as you mention it has the potential to create some weird bugs, but one of the nice things about DataContext being an interface is that users can create their own implementations for their specific application.

ThreadLocalDataContext implements DataContext {
    ThreadLocal<DataContex> delegate;
    // delegate all methods to delegate.get()
}

yannick · September 23, 2019, 1:31pm

Yes, that is certainly what many people will end up doing anyway, so I’m fine with having the code stub that you wrote inside my application instead of inside Orekit.

evan.ward · September 25, 2019, 2:44pm

I created #607 to track implementation of this proposal. I’ll get started and push to a new branch when I have a minimum implementation. At that point I would appreciate the community testing and review it to provide feedback early in the development cycle.

evan.ward · November 15, 2019, 8:38pm

Hi All,

I’ve made some progress in implementing the changes. It is larger than I originally thought and I’ve encountered a few issues that I think warrant some discussion. At this point I’m still planning to have the changes done for 10.1, though that is starting to seem like a stretch goal.

Serialization

Do we serialize auxiliary data? Previous behavior was to drop the auxiliary
data and use the *Factory.get methods to re-initialize. This meant it was the
user’s responsibility to make sure the serialzing end and the deserializing end
had the same auxiliary data, otherwise the transmitted objects would have
different values/meaning. We could keep this behavior, but it means that
serialization loses any information about the data context, as it does now.

The other option is to serialize the objects as they are actually used,
including the auxiliary data context. It means more data going over the wire,
but the ability to exactly reconstruct what was sent.

I don’t use Java’s serialization for security reasons. For the people that do
use serialization what would you prefer?

Cirular references

There are a few places where the existing factories use each other to do their work. In the new framework where many instances of each factory can be created this creates circular references in the dependency tree. A couple examples:

FramesFactory needs CelestialBodiesFactor to implement getIcrf() but
CelestialBodiesFactory needs FramesFactory to create the frame for getEarth().

Options:

Delete the circular reference: Remove getIcrf() from FramesFactory
Create an EarthFramesFactory on which both FramesFactory and
CelestialBodiesFactory depend. I.e. make the class hierarchy match the
contents of the data files.
Add framesFactory.setCelestialBodies(CelestialBodyFactory). I.e. keep the
circular reference.

EOP loaders need UTC to parse dates but TimeScalesFactory needs EOP for UT1.
Basically the same options as above.

My preference would for the second option since we plan to maintain backward compatibility.

yannick · November 16, 2019, 4:08pm

Hi Evan, nice to see that you have made progress in this big endeavor.

I used Java serialization a few years ago. It was mainly for short-term storage inside a single application (basically as a means of saving intermediate results to use them as input for another computation). In this context, partial serialization was fine for me because I could ensure that the data context would remain unchanged.

However, I have not used Java serialization enough to have a real opinion on the matter.

Regarding circular references, I agree that solution 2 seems the most promising. Having the class hierarchy move closer to the data ‘natural’ organization is usually a good sign. And removing the circular reference will probably save a few headaches for the users. But are you sure this would not break backward compatibility ?

luc · November 28, 2019, 8:42am

If serialization is a problem, just drop it.
We have already removed it at several places, we can continue this move as needed.
The initial choice for dropping auxilliary data and rebuild it (which typically appears when serializing frames) was to avoid excessive data to be stored. As an example, if you serialize a state, and it references an attitude that references an Earth ellipsoid, that references an Earth frame you end up serializing the full EOP and your state, which you expected could contain one hundred of byte or so ends up at a few megabytes. It was considered acceptable to just save the Earth frame by saving its name, not the underlying EOP.

+1 for the 2nd option for dealing with circular references.

evan.ward · December 3, 2019, 4:17pm

Initial Push

I just pushed up my work to the orekit repository. The data-context branch contains the modifications necessary so that every piece of code in Orekit has the option of not using the default data context. I believe that I have been able to do it in a backward compatible way and all tests pass. There were many ways to make the changes so I would appreciate your feedback on API design and interfaces that changed.

In particular I was able to avoid part of the circular reference problem by passing a reference to TimeScales to LazyLoadedEop.getEOPHistory(). The down side is that attempting to use timeScales.getUT1(...) before getEOPHistory() returns will cause a stack overflow.

Expect a few more smaller changes. Still on my TODO list:

review use of getUT1(EOPHistory) vs. getUT1(IERSConventions, boolean).
add empty data context that throws exceptions when used
add implementations of TimeScales, Frames, CelestialBodies, and GravityFields that use a specific, provided data set instead of loading it using DataProvidersManager. This will require creating interfaces for EOP, leap second parsers and making them public.
more unit tests
update package-info.java documentation
create a tutorial
update architecture documentation
checkstyle and other code clean up.
fix #627
thread local data context?
delete serialization code?

I’ve started on the first three already. Help is appreciated on the others.

Looking forward to 11.0 we should have a discussion of what to deprecate and what to leave, but first I think we should focus on getting 10.1 ready.

Annotation + Compiler Plugin

I also added another branch named annotation where I created @DefaultDataContext to document where the default data context is used. I also created a compiler plugin to emit warnings similar to @Deprecated. Even though I was careful to track the places that use the default data context this plugin found three that I missed. It could be useful for other developers who don’t want to use the default data context. Should we include it?

To try it out run cd annotation && mvn install && cd ../plugin && mvn install && cd .. && mvn compile

It needs some work before it is ready for production and we would need to figure out how to hook it into the build system.

luc · December 3, 2019, 4:22pm

Great!

I’ll look at it ASAP.

yannick · December 3, 2019, 9:27pm

Awesome ! That’s a huge amount of work you’ve done here, thank you very much.
I’ll give it a try soon.

luc · December 4, 2019, 2:06pm

I have pushed a few cosmetic changes (checkstyle warnings, used of deprecated methods, imports…).

Up to now, the API seems good to me. I will try to update the documentation, this will help me delve deeper into this API and some use cases.

evan.ward · December 4, 2019, 8:26pm

Thanks Luc, Yannick. I made a few changes to remove getUT1(EOPHistory) from the TimeScales interface since it is documented as for expert users only and to add to add implementations of TimeScales and Frames that don’t use DataProvidersManager. I decided not to add an empty data context since many classes (e.g. AbsoluteDate, Propagator) in Orekit need a functional default data context for class initialization to complete, even if those static variables are never used later.

luc · December 6, 2019, 2:58pm

I am still working on the documentation.
@evan.ward, do you intend to implement the PreloadedDataContext that appears in the diagram I found in
the design folder? I have simplified this diagram and was wondering if this implementation should still appear or not.

I think it would be interesting to have this implementation. It would be helpful in tests, and could be interesting in small scale applications that embed everything.