Re: [isabelle] Potential Spam: Isabelle/PIDE as IDE for Standard ML

On Tue, 18 Nov 2014, Michael Norrish wrote:

The lexeme "\<forall>" is not a valid string literal in SML (Poly, mlton and
Moscow ML, certainly agree with me), but it's fine in a file being interpreted
with SML_file.

That is not a "lexeme", but an "Isabelle symbol". This concept is below the lexical syntax, like \uXXXX in Java. It is documented in the "implementation" and "isar-ref" manual.

The deeper question here is how plain 7 bit ASCII -- the only universal text representation standard that actually works -- can be extended to do a bit more, without breaking too many things. Unicode with its many uncertainties does break a lot. The UTF-8 encoding could be called half-decent today, but the JVM uses the older UTF-16, which causes many headaches.

The SML_file mode also rejects "\\<" as a string literal, which is incorrect.

Formally that is a "malformed symbol". One could try harder to accept that. The question is if this is of practical relevance. Are there concrete SML programs where this happens?

We don't have millions of SML projects around. So it might be simpler to change these, e.g. by using "\\" ^ "<".

In Isabelle/PIDE, the support for "auxiliary files" is limited to something that conforms to Isabelle text represented with this special symbol notation. So far I have never seen practical problems, by mere luck. If there are concrete counter-examples, we can think about concrete counter-measures to avoid problems.

Similarly, the string literal "∀" (UTF-8 encoding) is interpreted as if it were "\<forall>".

That is a additional aspect. The PIDE default encoding is UTF-8-Isabelle, which identifies certain Unicode points with certain Isabelle symbols, as specified in the cumulative etc/symbols files of the Isabelle system installation.

Encoding for Unicode introduce inherent unreliably. In practice one normally commits to just one encoding. For Isabelle, the above does the job without too many people ever noticing the point.

For pure SML, one could try harder to conform to strict UTF-8, but again the question is practical relevance. Traditionally, SML never really supported Unicode anyway, so it cannot be relied on in existing programs.

It's perhaps arguable what SML really requires of "∀":  mlton and
Poly/ML both see it as an error; Moscow ML accepts it as a string of size 3.  If
mlton and Poly/ML were to accept it, I expect they would follow Moscow ML and

   "∀" = "\226\136\128"

to true.

Such things are generally dangerous in Unicode. Even with a well-defined encoding, so-called "combining sequences" introduce some uncertainty about the actual text content. Do you mean before or after normalization?

Worse, if I load a file with a genuine occurrence of the UTF-8 encoded "∀", then the IDE will silently write it out to disk as "\<forall>" when the file is saved. This even happens if the ∀ occurs bare in a comment. I find this disturbing.

It is quite well documented in the Isabelle/PIDE documentation and literature. You could disable these conversions by removing all etc/symbols files. For Isabelle that would be very impractical. For SML it could be done, if there were a real need for it.

Note that this is also a minor annoyance in Isabelle proper.  Users can't access
files called ∀.c by passing that name as written to an Isabelle/ML routine.
Instead they have to write  \226\136\128.c

Does that have any practical relevance, or is it just a synthetic example to break the system?

Unicode file-names are a sure way to ask for trouble. In conjunction with the "portable" JVM things are particularly fragile. I have seen routine problems with Far Eastern file names on Windows and sometimes Linux. There are also problems with continental European diacritics on Windows.

I don't think that Unicode will ever get there to deliver something that just works everywhere and for everyone. We can't just subscribe to a standard that is not a standard at all, but many of them with ongoing changes and no clear perspective.


          909,790 people so far

This archive was generated by a fusion of Pipermail (Mailman edition) and MHonArc.