Automatic transformation of XML namespaces/Transformations/Automatic transformation

An automatic transformation happens, when it is specified to do an automatic transformation, or when doing a workflow and the next element of the workflow refers to a :Auto object.

Below it's implicitly assumed that scripts we cannot execute (for example because not installed software) are excluded from enumeration.

Figuring out the next enriched script

Currently I recommend to us the second algorithm, because it requires specifying less relations and so will work with a success more often in real situations.

Common

I call a transformer universal if it is marked as :universal true and the "universal precedence" is known to be less than or equal than the precedence of this transformer.

Given a given set X of scripts, the set of checked scripts is intersection of X with the set of executed scripts if this intersection is not empty, or otherwise it is X.

The set of available chains from a set X of namespaces is defined as follows:

the set of source namespaces of the next enriched script is superset of the set target namespaces of the previous enriched script;
the set of source namespaces of the first enriched script intersects the set X;
either the set of target namespaces of the last enriched script is a subset of the set of target namespaces of the transformation or this script is universal;
(in addition to two above rules) the sequence is not a subsequence of other available chain (Consequently transformers having the same source and target should be skipped. Remark: Such transformers nevertheless may be useful as user-specified transformers.)

The set of first scripts for X is the set of checked scripts for the set of first scripts of the available channels for X.

The next outer script is determined as the following:

Take the first script for which all ancestors are in the target namespaces and for whose namespace (or whose attribute namespaces) there is a known outwardly processed script.
- Among the matching element and its attributes namespaces select:
  - element namespace (if matching)
  - if the element is not matching, the first matching attribute namespace in lexical order of namespaces.
- Consider scripts for which this namespace is in the set of source namespaces
- If there is no such script, there is no next outer script.
- Restrict to checked scripts.
- If there are several, choose with highest minimal preservance and among them of the highest priority. (FIXME: The above does not even warrant that such a chain of scripts exists!)

First algorithm (by precedence)

Rationale. The idea is to select the enriched script with highest precedence. If there are several enriched scripts with the same precedence and this precedence is a singleton, then select an order based on grouping.

Investigate: If different namespaces are not a child of one other, then we can apply both transformations without checking that one's precedence if above the other's, isn't it?

Consider the set of first scripts for the set of namespaces of the current XML document.
1. Optionally (TODO: add to user options) if there is more than one first steps of such executed scripts, give a warning or an error.
Choose the enriched script among first transformers of in these scripts with the highest precedence.
If the number of highest precedences is not equal to one, then there is no next enriched script.
If there is exactly one script with this precedence, return it.
If there are several such enriched scripts and their precedence is a singleton class, choose the enriched script for which there exists an available chain with highest minimal preservance and among them of the highest priority (see below). Note that only the first element of the chain is actually used, the rest are for calculation of priorities only. If the first elements of the chains are not a known member of to some known singular precedence, then terminate transforming or at least give a warning and choose the script with the first script of the highest priority chain.
Add the chosen enriched script to the digraph of used enriched scripts.

A sequence of enriched scripts is built in order to elaborate a transformation in the best way. The sequence is characterized by priority which should be minimized. The priority may be calculated taking into account preservance ( $p$ ), stability ( $s$ ), and preference ( $f$ ) of scripts. Possible formulas for priority which may (but is not required) be used to choose a path:

$\sum _{i}\left({\frac {1}{p_{i}}}+{\frac {1}{s_{i}}}+{\frac {1}{f_{i}}}\right)$
$\sum _{i}{\frac {1}{p_{i}+s_{i}+f_{i}}}$

TODO: Require $p,s,f\neq 0$ .

Rationale: Highest preservance is taken as minimum among the path to surely factor out tag stripping transformations (such as HTML -> plain text) even if competing with many steps of transformations with higher preservances.

Remark: If an enriched script in the chain duplicates has the same source and target as an earlier used transformation, then use the earlier used enriched script. (Rationale: Keep consistency.)

The chosen sequence of enriched scripts must have no cycles.

If it is unable to determine the next enriched script in the precedence order, then the processor should either give an error, or give a warning and choose the order arbitrarily.

Second algorithm (document order)

Rationale: This algorithm was created with the purpose to eliminate calculating maximums (which may not exist) in the partially ordered set of precedences of transformations. Should we take next step in this direction and make processing defined by the document order, trying not to depend on which scripts are already loaded?

Rationale: Should it process first enclosed tags or first outward tags? For structured documents need first process inward. For such things as <comment> tag (whose content is ignored) should first process outward <comment> tag not its content (however this is largely a performance only issue). As a compromise I propose mark some tags (<comment>) as outwardly processed, find first outwardly processed tags, and if there are none, then start with most enclosed.

Script is inward if it is marked as inward true in an asset RDF file.

Rationale: The assets for namespaces should be loaded in document order (not reverse document order), because outwardly processed elements take precedence of inwardly processed ones (and thus loading inner without loading outward makes no sense for determining the scripts order).

The next script is determined as follows:

If there is next outer script, return it.
If there is none, take the set of first scripts for the set {namespaces of elements for whose namespace there is a known inwardly processed script and for which all descendants are in the target namespace}. (TODO: Equally well we could instead choose last such element.) Rationale: We don't want to choose such an element by its namespaced attributes, because namespaced attributes should work for arbitrary content inside.
- Restrict them to checked scripts.
- Choose among them with highest minimal preservance and among them of the highest priority.
- If there are several choose one of them and give a warning.
Otherwise, there is no next script.

Third algorithm (document order 2)

It misbehaves when the first element (as below) is of the same namespace as one of the transformation target namespaces.

~~The next script is determined as follows:~~

~~If there is next outer script, return it.~~
If there is none, take the set of first scripts for the set {namespace of first element in document order for whose namespace there is a known inwardly processed script and for which all descendants are in the target namespace}. (TODO: Equally well we could instead choose last such element.) Rationale: We don't want to choose such an element by its namespaced attributes, because namespaced attributes should work for arbitrary content inside.
- ~~Restrict them to checked scripts.~~
- ~~Choose among them with highest minimal preservance and among them of the highest priority.~~
- ~~If there are several choose one of them and give a warning.~~
~~Otherwise, there is no next script.~~

Automatic transformation process

Automatic transformation consists of applying to the source document every script in turn, as described by the below algorithm.

After the pipeline is finished, at user option, fail if there are namespaces not in the destination list. At user option, erase all tags (and their descendants) and attributes not in the destination list.

After every transformation step and before starting the transformation, XML well-formedness should be checked. Also XML validity should be checked.

The algorithm of automatic transformation

Phases

Rationale: A phase is a series of transformations which ends either when the target namespace of the last transformation is the destination namespace of the transformation (a complete end of the transformation) or no particular namespace (as in XInclude) (and so we cannot analyzing continuing the transformation without running actual XML transformations).

A phase is an algorithm which determines an enriched script by by the following loop:

Build the list of XML namespaces based on the actual current XML document.
Figure next enriched script (as described in "Figuring out the next enriched script" section above).
If there is no next enriched script, download the next RDF file.
Exit from the loop if there are neither available enriched scripts, nor next RDF file.

The main loop

The main loop of an automatic transformation consists of repeatedly:

If all namespaces in the document are in the destination namespaces list, then exit from the loop.
Calculate next phase.
If the next phase didn't return an enriched script, exit from the loop.
~~Apply all enriched scripts in the phase.~~ Apply the enriched script returned by the phase, also add this enriched script to the set of executed scripts.

Rationale: The following algorithm allows not to download the destination namespace RDF at all, in the case if user requires to load destination namespaces last and the transformation happens to succeed without loading destination namespace. It also can do the reverse thing (that is load only destination namespace and don't load source namespaces).

Rationale: Requiring all namespaces in the current document (not only for the root element) allows to apply precedences (what is important for example for correct XInclude processing).

Validation after transformation

The user may request the software to combine transformation and validation after the transformation ends.

After transformation validation proceeds as usually, but with algorithm state (such as the set of loaded assets, etc.) is "inherited" from the transformation pass.

Remark: It was considered as another option to do validation after every transformation step, but it is unclear how to do validation of a half-processed document like the following:

<html xmlns="http://www.w3.org/1999/xhtml">
  <xinclude xmlns="http://www.w3.org/2001/XInclude" href="subdocument.xml"/>
</html>

Automatic transformation of XML namespaces/Transformations