Data Context Proposal

yannick · September 20, 2019, 3:28pm

Hi all,

Thank you @evan.ward for this ! There is obviously a lot of thought put into this proposition. I will need some time to understand all the implications of the architecture that you describe, but it certainly seems promising. As @luc mentionned, I am indeed very interested by this feature.

My use case

I would like to work with potentially different data sets on a per-thread basis. I have a server application running, spawning threads as required when computations are requested. Right now, I have to cope up with a limitation: all computations must share the same set of data. To change the data I must restart the server.

Another potential approach

To lift this limitation, I have begun exploring a different path, which I will attempt to describe. Please keep in mind that I have little experience with concurrency, and I do not know the internals of Orekit very well, so my proposition should be examined with a skeptical eye. I’ve done some ugly prototyping and it seems promising, but it is still too early to be sure it will work properly in all cases.

ThreadLocal variables

The basic idea would be to replace model data static variables by java.lang.ThreadLocal instances.
These are basically “thread-static” variables that have a single value for a given thread, whereas the usual “static” keyword enforces the consistency of the value for the entire process.

Separate data contexts per thread

So if we replace the static variables by thread-local ones, we should be able to fork a new thread, clear the data (without affecting other threads) and load the desired data in its place. I think this is equivalent to updating the data in place, as mentionned by Evan, but thread-by-thread instead of the whole application, so should be less prone to undesired behaviour.

Switching the data context within a thread

This change should also cover the use case where the user wants to perform sequential computations in the same thread, changing the data in-between. Clearing the data and setting up new DataLoaders before the next computation should do the trick.

InheritableThreadLocal

I believe a good optimization would be to use InheritableThreadLocal, so a newly forked thread is initialized with the values from its parent thread. For use cases where many threads are spawned, and will all use the same data, this architecture should have performances near equal to the current version of Orekit. Since this is probably a frequent use case, it seems important to keep it efficient.

This should also work for threads that share some (but not all) data. The common data could be loaded before forking the threads, to improve performance (but maybe not memory usage: depending on how InheritableThreadLocal is implemented, data might be duplicated).

Keeping track of all data

To be able to clear all cached data easily, I think we could centralize references to all instances of thread-local variables as described in the following architecture.

The idea is to have a custom class that inherits from ThreadLocal, with a constructor that automatically registers the instance in a dedicated singleton. Via the singleton, it is then possible to clear all data associated with a thread. This should have roughly the same effect as the clearFactories() method from the test class Utils, while being easier to maintain (the list of instances to clear will be built at runtime, so there should be no need to update this method when the rest of the code changes).

Since the data will be cleared only for the thread calling the clear method, I do not expect issues with data disappearing while being used. Because the thread is currently calling the clear method, it cannot perform computations at the same time.

Some thoughts

At a glance, I think that this approach would be less ambitious than the first one, covering maybe fewer use cases. It introduces a strong coupling between the threads and the data contexts, to the point where there is no real equivalent of the DataContext object from the first proposal. This might be a bad thing. But I also think it would require a bit less work, because we would not need to duplicate the contructors/methods for every class of Orekit that uses model-related data.

However there is still a significant amount of work to do: ThreadLocal is a wrapper class, not a keyword like “static”. So the refactoring is quite significant: instead of just using a static variable, we have to get() the value of the thread-local variable everywhere it is used. And it will be hard to ensure that no static variable used for a data context has been forgotten.

On the bright side, I believe we can hide this change behind the public API to make it backward-compatible with current Orekit-based applications. If the user does not change his code, all data will simply be stored automatically in the context associated with the main thread.

Thanks again Evan for this initiative. When the plan is finalized, if you want to share the load of developing this feature, I’m willing to help however I can.