Re: [isabelle] Unicode (Was: Update on I3P)



On Fri, 10 Sep 2010, Joachim Breitner wrote:

I should add that this sequence of four characters is the utf8
representation of 𝒵 (\<Z>), re-encoded as utf8:
$ echo -n ð??µ|iconv -futf8 -tlatin1
𝒵
$ echo -n 𝒵 | iconv -flatin1 -tutf8
ð??µ

Maybe i3p has problems with unicode character code points that do not
fit in one 16-bit number?

This is a deeper problem of the way Unicode works on the JVM, see also http://download.oracle.com/javase/6/docs/api/java/lang/Character.html

Few Java applications get this specification of "surrogate code points" right, because this situation occurs rarely. UTF-8 works generally better because the multi-character sequences are the rule, not the exception.

It is unlikely that this will get fundamentally better on the JVM, although some people have already started joking that Oracle will soon identify "" and null for type String ...

The easyiest fix is to remove these non-16bit codes from the table of Isabelle symbol interpretation.


	Makarius


This archive was generated by a fusion of Pipermail (Mailman edition) and MHonArc.