New framework for CCSDS messages

Hi all,

As some of new may know, we are rewriting from ground up the framework for handling CCSDS messages (both parsing and writing). This work started as a new version of ODM (version 3) will be published and it includes a new message type: OCM which stands for Orbit Comprehensive Message.
Orekit will be one of the two independent implementations of the standard (the other one being STK)
that are required by CCSDS before a proposed version becomes an official recommendation.

This feature corresponds to issue Add support for version 3 of CCSDS Orbit Data Messages (#474) · Issues · Orekit / Orekit · GitLab

As we started working on this, it appears the framework we used up to Orekit 10.3 would not scale, and
to be honest was rather clumsy. We also wanted to generalize writing of messages (currently we can write only AEM and OEM), and we wanted to parse both KVN and XML versions for all message types (currently we parse XML only for TDM messages). So we needed to overhaul the framework.

This work is far from complete, but it is in a state that can be presented, and more importantly discussed among the Orekit community. So I have started this topic to present you the current state of the work,
point you to the code, and ask you for reviews and your opinions about the design.

The new framework introduces some incompatibilities for users.

The main one is that once a CCSDS message has been parsed, the users will need to go through a hierarchical tree of objects to access primitive data. For example getting the originator requires to write something like file.getHeader().getOriginator(). There are some convenience methods for specific cases (like generating a SpacecraftState directly from an OPM), but only in a few cases. If you would like to have more convenience methods like this, please ask for them.

Another incompatibility is that instead of using several signatures to create the parser, we use a single one with a DataSource (which is a new name for the old NamedData class), and which has several parameters. I think replacing random uses of File/file name/InputStream with DataSource would improve consistency, as well as allowing filtering.

A third incompaibility is in the write part for the generic interfaces (i.e. the EphemerisFile part) as some CCSDS files may be created from generic ephemerides. The implementation we used up to 10.3 was to put everything CCSDS needs in the generic interface so CCSDS writer is happy, and to put a lot of metadata in a simple hashtable. This ended up with some ephemerides declaring several methods the just return null when they have no equivalent to CCSDS, and write methods failing at run time when some metadata was not in the hashtable. The interface was really modelled after CCSDS, it was not really generic. So we stripped down the generic interface to a bare minimum, without the CCSDS specificities. Then, when really writing an OEM or OCM ephemeris, we need to provide these required data. Our solution is to have the information specific to one format being provided when the format writer is created, and later to have the raw ephemeris data being written.

The current state of the work is available on the gitlab main repository on branch issue-474. It is huge
and uses several design patterns and Java specificities (visitor pattern, state pattern, methods references). Please take a look at it and tell us what you think.

I will write a follow-up message in this topic, copying the current state of the documentation with a few UML diagrams. This documentation is already in the source tree, so you can generate it by yourself
using mvn site, the following message will just be a copy-paste and probably some editing to take care of markdown syntax differences between site generation and what the forum uses.


The org.orekit.files.ccsds package provides classes to handle parsing
and writing CCSDS messages.

Users point of view


The package is organized in hierarchical sub-packages that reflect the sections
hierarchy from CCSDS messages, plus some utility sub-packages. The following class
diagram depicts this static organization.

The org.orekit.files.ccsds.section sub-package defines the generic sections
found in all CCSDS messages: Header, Metadata and Data. All extends the
Orekit-specific Section interface that is used for checks at the end of parsing.
Metadata and Data are gathered together in a Segment structure.

The org.orekit.files.ccsds.ndm sub-package defines a single top-level abstract
class NDMFile, which stands for Navigation Data Message. All CCDSD messages extend
this top-level abstract class. NDMfile is a container for one Header and one or
more Segment objects, depending on the file type (for example OPMFile only contains
one segment whereas OEMFile may contain several segments).

There are as many sub-packages as there are CCSDS message types, with
intermediate sub-packages for each officialy published recommendation:
org.orekit.files.ccsds.ndm.adm.apm, org.orekit.files.ccsds.ndm.adm.aem,
org.orekit.files.ccsds.ndm.odm.opm, org.orekit.files.ccsds.ndm.odm.oem,
org.orekit.files.ccsds.ndm.odm.omm, org.orekit.files.ccsds.ndm.odm.ocm,
and org.orekit.files.ccsds.ndm.tdm. Each contain the logical structures
that correspond to the message type, among which at least one ##MFile
class that represents a complete message/file. As some data are common to
several types, there may be some intermediate classes in order to avoid
code duplication. These classes are implementation details and not displayed
in the previous class diagram. If the message type has logical blocks (like state
vector block, Keplerian elements block, maneuvers block in OPM), then
there is one dedicated class for each logical block.

The top-level file also contains some Orekit-specific data that are mandatory
for building some objects but is not present in the CCSDS messages. This
includes for example IERS conventions, data context, and gravitational
coefficient for ODM files as it is sometimes optional in these messages.

This organization has been introduced with Orekit 11.0. Before that, the CCSDS
hierarchy with header, segment, metadata and data was not reproduced in the API
but a flat structure was used.

This organization implies that users wishing to access raw internal entries must
walk through the hierarchy. For message types that allow only one segment, there
are shortcuts to use file.getMetadata() and file.getData() instead of
file.getSegments().get(0).getMetadata() and file.getSegments().get(0).getData()
respectively. Where it is relevant, other shortcuts are provided to access
Orekit-compatible objects as shown in the following code snippet:

OPMFile         opm       = ...;
AbsoluteDate    fileDate  = opm.getHeader().getCreationDate();
Vector3D        dV        = opm.getManeuver(0).getdV();
SpacecraftState state     = opm.generateSpacecraftState();
// getting orbit date the hard way:
AbsoluteDate    orbitDate = opm.getSegments().get(0).get(Data).getStateVectorBlock().getEpoch();

Message files can be obtained by parsing an existing file or by using
the setters to create it from scratch, bottom up starting from the
raw elements and building up through logical blocks, data, metadata,
segments, header and finally file.


Parsing a text message to build some kind of NDMFile object is performed
by setting up a parser. Each message type has its own parser. Once created,
its parseMessage method is called with a data source. It will return the
parsed file as a hierarchical container as depicted in the previous

The Orekit-specific data that are mandatory for building some objects but are
not present in the CCSDS messages are set up when building the parser. This
includes for example IERS conventions, data context, and gravitational
coefficient for ODM files as it is sometimes optional in these messages.
One change introduced in Orekit 11.0 is that the progressive set up of
parser using the fluent API (methods withXxx()) has been removed. Now the
few required parameters are all set at once in the constructor. Another change
is that the parsers are mutable objects that gather the data during the parsing.
They can therefore not be used in multi-threaded environment. The recommended way
to use parsers is to either dedicate one parser for each message and drop it
afterwards, or to use a single-thread loop.

Parsers automatically recognize if the file is in Key-Value Notation (KVN) or in
eXtended Markup Language (XML) format and adapt accordingly. This is
transparent for users and works with all CCSDS message types. The data to
be parsed is provided using a DataSource object, which combines a name
and a stream opener and can be built directly from these elements, from a file name,
or from a File instance. The DataSource object delays
the real opening of the file until the parseMessage method is called and
takes care to close it properly after parsing, even if parsing is interrupted
due to some parse error.

The OEMParser and OCMParser have an additional feature: they also implement
the generic EphemerisFileParser interface, so they can be used in a more
general way when ephemerides can be read from various formats (CCSDS, CPF, SP3).
The EphemerisFileParser interface defines a parse(dataSource) method that
is similar to the CCSDS-specific parseMessage(dataSource) method.

As the parsers are parameterized with the type of the parsed file, the parseMessage
and parse methods in all parsers already have the specific type, there is no need
to cast the returned value.

The following code snippet shows how to parse an oem file, in this case using a
file name to create the data source:

OEMParser  parser = new OEMParser(conventions, simpleEOP, dataContext,
                                  missionReferenceDate, mu, defaultInterpolationDegree);
OEMFile    oem    = parser.parseMessage(new DataSource(fileName));


Writing a CCSDS message is done by using a specific writer class for the message
type and using a low level generator corresponding to the desired file format,
KVNGenerator for Key-Value Notation or XMLGenerator for eXtended Markup Language.

Ephemeris-type messages (AEM, OEM and OCM) implement the generic ephemeris writer
interfaces (AttitudeEphemerisFileWriter and EphemerisFileWriter) in addition
to the CCSDS-specific API, so they can be used in a more general way when ephemerides
data was built from non-CCSDS data. The generic write methods in these interfaces
take as arguments objects that implement the generic
AttitudeEphemerisFile.AttitudeEphemerisSegment and EphemerisFile.EphemerisSegment
interfaces. As these interfaces do not provide access to header and metadata informations
that CCSDS writers need, these informations must be provided beforehand to the
writers. This is done by providing directly the header and a metadata template in
the constructor of the writer. Of course, non-CCSDS writers would use different
strategies to get their specific metadata. The metadata provided is only a template that
is incomplete: the frame, start time and stop time will be filled later on when
the data to be written is available, as they will change for each segment. The
argument used as the template is not modified when building a writer, its content
is copied in an internal object that is modified by adding the proper frame and
time data when each segment is created.

Ephemeris-type messages can also be used in a streaming way (with specific
Streaming##MWriter classes) if the ephemeris data must be written as it is produced
on-the-fly by a propagator. These specific writers provide a newSegment() method that
returns a fixed step handler to register to the propagator. If ephemerides must be split
into different segments, in order to prevent interpolation between two time ranges
separated by a discrete event like a maneuver, then a new step handler must be retrieved
using the newSegment() method at discrete event time and a new propagator must be used.
All segments will be gathered properly in the generated CCSDS file. Using the same
propagator and same event handler would not work as expected. The propagator would run
just fine through the discrete event that would reset the state, but the ephemeris would
not be aware of the change and would just continue the same segment. Upon reading the
file produces this way, the reader would not be aware that interpolation should not be
used around this maneuver as the event would not appear in the file.

TODO: describe CCSDS-specific API

Developers point of view

This section describes the design of the CCSDS framework. It is an implementation
detail and is useful only for Orekit developers or people wishing to extend it,
perhaps by adding support for new messages types. It is not required to simply
parse or write CCSDS messages.


The first level of parsing is lexical analysis. Its aim is to read the
stream of characters from the data source and to generate a stream of
ParseToken. Two different lexical analyzers are provided: KVNLexicalAnalyzer
for Key-Value Notation and XMLLexicalAnalyzer for eXtended Markup Language.
The LexicalAnalyzerSelector utility class selects one or the other of these lexical
analyzers depending on the first few bytes read from the data source. If the
start of the XML declaration ("<?xml …>") which is mandatory in all XML documents
is found, then XMLLexicalAnalyzer is selected, otherwise KVNLexicalAnalyzer
is selected. Detection works for UCS-4, UTF-16 and UTF-8 encodings, with or
without a Byte Order Mark, and regardless of endianness. After the first few bytes
allowing selection have been read, the characters stream is reset to beginning so
the selected lewical analyzer will see these characters again. Once the lexical
analyzer has been created, the message parser registers itself to this analyzer by calling
its accept method, and wait for the lexical analyzer to call it back for processing
the tokens it will generate from the characters stream. This is akin to the visitor
design pattern with the parser visiting the tokens as they are produced by the lexical

The following class diagram presents the static structure of lexical analysis:

The dynamic view of lexical analysis is depicted in the following sequence diagram:

The second level of parsing is message parsing is syntax analysis. Its aim is
to read the stream of ParseToken objects and to progressively build the CCSDS message
from them. Syntax analysis of primitive entries like EPOCH_TZERO = 1998-12-18T14:28:15.1172
in KVN or <EPOCH_TZERO>1998-12-18T00:00:00.0000</EPOCH_TZERO> in XML is independent
of the file format: in both lexical analyzers will generate a ParseToken with type set
to TokenType.ENTRY, name set to EPOCH_TZERO and content set to 1998-12-18T00:00:00.0000.
This token will be passed to the message parser for processing and the parse may ignore
that the token was extract from a KVN or an XML file. This simplifies a lot parsing of both
formats and avoids code duplication. This is unfortunately not true for higher level structures
like header, segments, metadata, data or logical blocks. For all these cases, the parser must
know if the file is in Key-Value Notation or in eXtended Markup Language, so the lexical
analyzer starts parsing by calling the parser reset method with the file format as an
argument, so the parser knows how to handle the higher level structures.

CCSDS messages are complex, with a lot of sub-structures and we want to parse several types
(APM, AEM, OPM, OEM, OMM, OCM and TDM as of version 11.0). There are hundreds of keys to
manage (i.e. a lot of different names a ParseToken can have). Prior to version 11.0, Orekit
used a single big enumerate class for all these keys, but it proved unmanageable as more
message types were supported. The framework set up with version 11.0 is based on the fact
these numerous keys belong to a smaller set of logical blocks that are always parsed as a
whole (header, metadata, state vector, covariance…). Parsing can be performed with the
parser switching between a small number of well-known states. When one state is active,
say metadata parsing, then lookup is limited to the keys allowed in metadata. If an
unknown token arrives, then the parser assumes we have finished the current section, and
it switches into another state, say data parsing, that is the fallback to use after
metadata. This is an implementation of the State design pattern. Parsers always have
one current ProcessingState that remains active as long as it can process the tokens
provided to it by the lexical analyzer, and the have a fallback ProcessingState to
switch to when a token could not be handled by the current one. The following class
diagram shows this design:

All parsers set up the initial processing state when their reset method is called
by the lexical analyzer at the beginning of the message, and they manage the fallback
processing state by anticipating what the next state could be when one state is
activated. This is highly specific for each message type, and unfortunately also
depends on file format (KVN vs. XML). As an example, in KVN files, the initial
processing state is already the HeaderProcessingState, but in XML file it is
rather XMLStructureProvessingState and HeaderProcessingState is triggered only
when the XML <header> start element is processed. CCSDS messages type are also not
very consistent, which makes implementation more complex. As an example, APM files
don’t have META_START, META_STOP, DATA_START or DATA_STOP keys in the
KVN version, whereas AEM file have both, and OEM have META_START, META_STOP
but have neither DATA_START nor DATA_STOP. All parsers extend the AbstractMessageParser
abstract class from which declares several hooks (prepareHeader, inHeader,
finalizeHeader, prepareMetadata…) which can be called by various states
so the parser knows where it is and prepare the fallback processing state. The
prepareMetadata hook for example is called by KVNStructureProcessingState
when it sees a META_START key, and by XMLStructureProcessingState when it
sees a metadata start element. The parser then knows that metadata parsing is
going to start an set up the fallback state for it.

When the parser is not switching states, one state is active and processes all
upcoming token one after the other. Each processing state may adopt a different
strategy for this, depending on the section it handles. Processing states are
always quite small. Some processing states that can be reused from message type
to message type (like HeaderProcessingState, KVNStructureProcessingState or
XMLStructureProcessingstate) are implemented as classes. Other processing
states that are specific to one message type (and hence to one parser), are
implemented as a single private method within the parser and method references
are used to point directly to the method. This allows one parser class to
provide simultaneously several implementations of the ProcessingState interface.

In many cases, the keys that are allowed in a section are fixed so they are defined
in an enumerate. The processing state (in this case often a private method within
the parser) then simply selects the enum constant using the standard valueOf method
from the enumerate class and delegates token processing to it. The enum constant
then just call one of the processAs method from the token, pointing it to the
metadata/data/logical block setter to call for storing the token content. For
sections that both reuse some keys from a more general section and add their
own keys, several enumerate types can be checked in row. A typical example of this
design is the processMetadataToken method in OEMParser, which is a single
private method acting as a ProcessingState and tries the enumerates MetadataKey,
ODMMetadataKey, OCommonMetadataKey and finally OEMMetadataKey to fill up
the metadata section.

Adding a new message type (lets name it XYZ message) involves creating the XYZFile
class that extends NDMFile, creating the XYZData container for the data part,
and creating one or more XYZSection1Key, XYZSection2Key… enumerates for each
logical blocks that are allowed in the message format. The final task is to create
the XYZParser and set up the state switching logic, using existing classes for
the global structure and header, and private methods processSection1Token,
processSection2Token… for processing the tokens from each logical block.

Adding a new key to an existing message when a new version of the message format
is published by CCSDS generally consist in adding one field in the data container
with a setter and a getter, and one enum constant that will be recognized by
the existing processing state and that will call one of the processAs method from
the token, asking it to call the new setter.


The following class diagram presents the implementation of writing:

TODO explain diagram

Congratulations for this incredible work!!
That’s a very good improvement. Even if it introduces important changes and incompatibilities for users that will update their Orekit from 10.3 to 11.0, it was necessary.

For me, all the incompatibilities introduce are not problems. For instance, a hierarchical tree can be very useful to improve the understanding of the file structure. Especially with CCSDS where the structure is sometimes complicated. Refactoring of the generic interfaces was also necessary and it is a very good improvement.

I don’t have many questions or remarks because we already discussed that changes. I just have some questions:

  • Is it normal that you removed the AttitudeEphemerisFileParser interface?
  • What is the meaning of the “O” letter in OCommonMetadata class?
  • What are the next steps? Finish to implement the OCM format? Add Writer classes for all formats (APM, OMM, OPM, etc.)?
  • For the future OMMWriter, do you think it could be a good functionality to have the possibility to write an OMM file from a TLE object?


There was just one implementation, so I removed it. We could put it back if you want.

Orbit, like the rest of the formats. It could be removed or replaced by something
more explicit. It is common to some of the O*M messages, but not all.

I think I will first focus on OCM, by initially adding the missing logical blocks, then adding parsing of
XML version, then adding write. This will allow to start exchanging messages with the team in
charge of the other reference implementation.

Then, while still exchanging OCM files, I will add XML parsing for all formats and finish by adding writers for all formats.

We can also wait to see if @dgondelach can contribute support for the latest TDM version with some new keywords.

A final step would be to add support for Conjunction Data Message


Wow big overhaul. I think that is great to make the code more consistent and add support for the new formats. I like the new structure and separating out lexical vs semantic parsing.

Do you have a link to the new CCSDS standard? I looked for it briefly but I couldn’t find it.

Can Orekit still read and write version 2 messages? If Orekit is only the second software to support version 3 it seems like we will still need to support version 2 for some time.

Not a big deal but now to use them in a concurrent environment users have to create XxxParserBuilder classes that hold the constructor parameters for the parser. The parser could be made thread safe by splitting out a separate class for the parse state, but probably not worth reworking the code yet again. Given the number of constructor parameters users will probably want to make builder classes anyways for convenience.

Technically the declaration is optional in both XML 1.0 and XML 1.1 as indicated by the word “should”. See section 4.3.1 of Extensible Markup Language (XML) 1.1 (Second Edition) and Extensible Markup Language (XML) 1.0 (Fifth Edition) specs. It is a bit confusing though because section 2.8 of the 1.1 spec says the declaration is required for XML 1.1 but not for 1.0. Looking at the implementation in there is a reference to Extensible Markup Language (XML) 1.0 (Fifth Edition) which states “Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration…” That is, <?xml is unnecessary in XML if it is in UTF-8 or in UTF-16. In XML UTF-8 is the default and UTF-16 is requires a BOM making the encoding declaration redundant. The code implemented rejects XML that doesn’t start with <?xml which means it will be rejecting well formed XML documents.

Using StreamOpener in DataSource makes me nervous because the interface seems to imply the stream can be closed and opened many times, like a file. Often streams from the network can only be read once start to finish. Looking at the implementation I see you’re careful to only open and read the underlying stream once. Perhaps modifying the DataSource interface or adding some documentation would make the behavior clearer.

Since Bryan brought up naming I’ll add my two cents though it is very minor and feel free to ignore me. I was wondering what O stood for as well and would not have guessed Orbit. With acronyms in class names I think camel case makes it easier to read as it provides breaks between words, e.g. KvnLexicalAnalyzer instead of KVNLexicalAnalyzer. Also makes it easier to search for the class in your IDE by typing KLA at the search bar. Again, very minor.

AttitudeEphemerisFileWriter interface also has only one implementation but it still exists. I think that for consistency between the attitude and orbit files, the AttitudeEphemerisFileParser interface should be put back. Furthermore, it will have the merit to exist if a second implementation is added.

It is a very interesting program.

The document is not public yet. In order to be approved as a CCSDS recommendation, two independent implementations are required, and this what we are working on. I can therefore not publish the intermediate draft I have on the forum for the general public. I will send it to you in a private message as I have it in order to work on the implementation, and you belong to the development team and have valuable opinion about this implementation.

Yes, maintaining compatibility is important here, and version 2 will probably be used for many years. Version 3 is mostly compatible with version 2 as far as the existing message types (OPM, OEM and OMM) include only two changes:

  • the Mission Elapsed Time (MET) and Mission Relative Time (MRT) are not present anymore
    in the recommendation as they are relative to external data from an ICD. SCLK is still allowed
    because now the T0 fro this scale is present in the OCM metadata. Orekit will still support MET
    and MRT by providing the mission reference date in the parser, as before
  • a new MESSAGE_ID key is allowed in all ODM header, including the existing ones, and is

You are right. I just added a general ParserBuilder in the ndm package that can build parsers for all message types, and recommended using this builder in the documentation.

I overlooked this. It is a big problem. One solution could be to have the selector choose XML or KVN when it is certain the format is right (i.e. when we really found <?xml in one case and CCSDS_###_VERSION in the other case) and to fall back to a default configured in the parser(and hence configured using a withDefaultFormat(Fileformat) in the new ParserBuilder. Do you think it
would be acceptable?

I hesitated a lot on this point, and exactly for the reason you point out. This is why I added an intermediate BufferedInputStream to avoid closing and reopening the stream. I don’t know how
to change the interface as if the stream comes from some live stream, allowing to reopen it would
imply saving the data ourselves which could be a waste of memory. Improving the documentation seems a better way to me. Anyway, the original use of StreamOpener was only to allow a lazy
opening, or even not opening some data sources at all (for example in DirectoryCrawler, if the
name of the data does not match what the dataLoader expects). We could say in the documentation that the stream is not expected to be opened twice, but either not opened at all or opened once.

OK, I have done that, including for the classes named after CCSDS messages (and there was quite a number of this). Fortunately, IDE helps a lot renaming classes and methods.

I’m fine with that. I have put the interface back (and it is really small).

Well, reading the proposed standard you sent clarified your original approach as acceptable. Specifically Section 8.2. Looks like CCSDS is adding an additional restriction to XML 1.0 for the ODM. But even the example in Annex H doesn’t follow the rule of always including the xml declaration.

The version 3 spec also seems to delegate the list of supported frames, time scales, etc. to a SANA website. That would make it hard to remain compatible with the spec as the website is modified. Annex B seems to imply that there aren’t any values that have to be supported. (“The message
creator should seek to confirm with the recipient(s) that their software can support the selected
keyword value”) Annex B also seems to imply that using values not in the SANA registry is also acceptable. (“Until a suggested value is included in the SANA registry, exchange partners may
define and use values that are not listed in the SANA registry if mutually agreed between
message exchange partners.”) Therefore IMHO the CCSDS spec does not achieve its goal of facilitating interoperability and data exchange because two fully conforming implementations of the spec do not need to understand the same frames, time scales, etc.

This creates some practical issues for Orekit. For example, the website also has some oddities such as not allowing GCRF for REF_FRAME but instead requiring GCRFn where n is a number. I don’t think that makes a practical difference for Orekit as the revisions to ICRF/GCRF seek to maintain stable axes, just reducing the noise of the position of defining sources. Not sure how we should handle the new website registry framework going forward. Thoughts?


I think the solution selected by the other reference implementation team is to download the current version of the lists using the CSV or JSON API. I would refrain from doing this automatically and for each message parsed (I don’t like applications that randomly connect to web, and Orekit is sometimes used in locked-down networks). We could have these CVS/JSON files put in orekit-data, but I don’t known what we could do with them. We still need to have mappings between strings, CCSDS enumerates, and Orekit objects.
If a message contains a reference to an object that is not supported by Orekit, I don’t know what to do with it. Having both the String and the object version of some entry like we had in 10.x is not really a solution. We have to always check if the object is null or not, and in fact if it is we could just read the message and do nothing with it (typically if we do not understand what frame it uses).

Perhaps when parsing just save the String. Then only attempt to turn it into an Orekit object when need, e.g. when the user calls getPropagator(). That way the user may be able to still examine the contents of some files using Orekit even if Orekit can’t built a propagator from it. Not sure if that is a use case worth supporting though.

When writing it probably makes sense to just allow any value for these fields because any restrictions implemented at compile time could easily be out of date by run time.

I have done this, introducing a FrameFacade in the isse-474 branch, that holds references to various frame types (Orekit frame, CCSDS celestial body frame, CCSDS orbit-relative frame, CCSDS spacecraft body frame, and simple name) as ADM allow any of these frames to be used for attitude. If a name matches none of the specialized types, only its name is retained. This seems to work as one of the OCM tests uses an EFG frame that Orekit does not support.

Hi Luc,

I started using the new API a bit and I noticed an issue with the move to DataSource. It is no longer possible to specify an encoding or use a Reader.

In the previous release the user could provide a BufferedReader. That let the user take care of decoding the bytes into characters and figuring out the character encoding. The new implementation with DataSource only can provide an InputStream and the parser classes (SP3, OEM, etc.) all assume a UTF8 encoding. That makes it hard to use data that is in any other encoding, or data that is already characters (e.g. a StringReader). Change seems to have been made in 51190cd.

One option could be to take inspiration from the org.xml.sax.InputSource class which allows the user the flexibility of providing an InputStream with optional encoding or a Reader.

Another options could be providing a second parse(..) method in EphemerisFileParser that takes a Reader.

What do you think?


1 Like

You are right.

I don’t really know how to fix this.

Adding a second parse method looks simpler at first sight, but I fear it will not work in all cases. First, we should also add it to the MessageParser interface that is implemented by all CCSDS parsers, but this is not a problem. The problem is that the parsers delegate to a LexicalAnalyzer and that LexicalAnalyzerSelector needs to see raw bytes before selecting either KvnLexicalAnalyzer, which will add the encoding or XmlLexicalAnalyzer which will let the SaxParser select the proper encoding. We could pass an InputSource rather than an InputStream to the SaxParser, but that is only after having selected the lexical analyzer.

We could add an encoding property to the parsers so they build the reader properly if they need to. It would simply be set by adding a withEncoding to ParserBuilder, with a default value set to UTF8. Unfortunately, that would not solve the case for input data that is already in characters (except if we attempt to convert characters to bytes so the parser can convert them back, which would be really ugly and inefficient).

So perhaps having DataSource provide both an InputStream or a Reader would be more general. If we go this way, it would mean adding several constructors to build the DataSource (from file, from stream opener, from already open stream, from reader…) and then several streams types (bytes or character). The DataSource implementation would know how to go from the raw source to the stream.
This approach also has its own drawbacks. We should check all current use of DataSource constructors to avoid getting the InputStream and convert it to a Reader but rather directly use the Reader provided by the DataSource (tedious work but feasible). Another more difficult problem is still linked to LexicalAnalyzerSelector as it reads the first bytes using an InputStream and wrapping them in a BufferedInputStream so it can rewind them and build a new DataSource that points to this already open (and rewound) stream when it creates either the XmlLexicalAnalyzer or the KvnLexicalAnalyzer.

So I don’t know what should be done here.

If the stream is already characters I think the job of the LexicalAnalyzerSelector becomes easier because it doesn’t have to worry about encoding. Overall it would be a bit more complex because it would have to separate cases for characters vs. bytes. In pseudo code:

class DataSource {
  InputStream stream;
  BufferedReader reader;

class LexicalAnalyzerSelector {

  LexicalAnalyser select(DataSource source) {
    if (source.reader != null) {
      // have characters
      char[] buf = new char[5];;
      if ("<?xml".equals(new String(buf))) {
        return new XmlLexicalAnalyzer(source);
      } else {
        return new KvnLexicalAnalyzer(source);
    } else {
      // existing algorithm with byes ...

Clearly not production ready, maybe not even the exact types you would want to use. But I think it illustrates a path forward. And it could be used with the overloaded parse method or the expanded DataSource as illustrated here.

Do you think that would work?


Yes it should work.
I let you handle this.