Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
DiscussionsAccessExcelInfoPathOutlookPowerPointPublisherWord
DirectoryUser Groups
Related Topics
Outlook ExpressInternet ExplorerWindowsMS Server ProductsMore Topics ...

MS Office Forum / Word / General MS Word Questions / October 2005

Tip: Looking for answers? Try searching our database.

Bug? searching for high unicode ranges

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Bruce Rusk - 18 Oct 2005 02:25 GMT
I seem to have stumbled across a bug in Word's (I have 2003 on XP Pro
English) handling of certain Unicode ranges in searches.

I have the extended Chinese font SimSun (Founder Extended) installed from
the Proofing Tools, and it contains a number of characters from the CJK
Unified Ideographs Extension (codepoints U+20000 and up). I can do a normal
find, either from the UI or with VBA manipulation of the Find object, for
characters in this range without a problem. But Word doesn't seem to be able
to handle wildcard searches (in the format [A-Z] for a range of characters
from this part of Unicode. A little experimentation suggests that it can't
handle anything from U+10000 up.

To replicate the problem, open the Find dialog, check "Use Wildcards," and
enter the following:

"[10000"
then click Alt-X to insert the character, then a dash ("-"), then
"10001"
then Alt-X again, then "]"

Two boxes should replace the characters (unless you have a font installed
that shows Linear B). When I perform the search, I get the following error
message:

   "The Find What text contains a range that is not valid."

Word should be able to handle this, since:

- You can do the the search in this way for any character range below
U+10000

- You can search individually for characters U+10000 and above (try
inserting one and searching for it)

Can anyone else replicate this problem? And if anyone is beta-testing Word
12, can you test if the same problem persists there (and if so, fire off a
bug report?).

Thanks,

Bruce Rusk
Tony Jollans - 18 Oct 2005 15:20 GMT
It wouldn't surprise me at all if you  have found a limitation of Word. I
believe Word uses UTF-16 encoding which can't cope with characters above
U+10000.

--
Enjoy,
Tony

> I seem to have stumbled across a bug in Word's (I have 2003 on XP Pro
> English) handling of certain Unicode ranges in searches.
[quoted text clipped - 37 lines]
>
> Bruce Rusk
Michael (michka) Kaplan [MS] - 18 Oct 2005 16:43 GMT
Correct -- you have to the surrogate pairs to find the text.

Signature

MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

> It wouldn't surprise me at all if you  have found a limitation of Word. I
> believe Word uses UTF-16 encoding which can't cope with characters above
[quoted text clipped - 53 lines]
>>
>> Bruce Rusk
Bruce Rusk - 18 Oct 2005 17:33 GMT
But you can find such characters individually if you have one in the text
and then search for it. You can find it through the normal search dialog.
It's only when you search for it as a wildcard (either as a range, in the
original post, or just a single-character wildcard, [X]) that it fails.

How would a surrogate pair search work?

> Correct -- you have to the surrogate pairs to find the text.
>
[quoted text clipped - 58 lines]
>>>
>>> Bruce Rusk
Michael (michka) Kaplan [MS] - 21 Oct 2005 10:43 GMT
Well, perhaps there is a bug there.... but the key would be to think of the
supplementary character as a surrogate pair so that you are searching for
the two UTF-16 code points. Ranges will not work, unless they try to take
this into account.

I am not sure of the syntax you are using, but the wildward failure may be
similar....

Signature

MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

> But you can find such characters individually if you have one in the text
> and then search for it. You can find it through the normal search dialog.
[quoted text clipped - 66 lines]
>>>>
>>>> Bruce Rusk
Michael (michka) Kaplan [MS] - 24 Oct 2005 16:07 GMT
Actually, more info here:

http://blogs.msdn.com/michkap/archive/2005/10/24/483965.aspx

:-)

Signature

MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

> Well, perhaps there is a bug there.... but the key would be to think of
> the supplementary character as a surrogate pair so that you are searching
[quoted text clipped - 77 lines]
>>>>>
>>>>> Bruce Rusk
Bruce Rusk - 24 Oct 2005 17:18 GMT
Thanks, Michael, that's a great explanation

So the question that raises for me is ....

Is there a way to use Word's find/replace interface (from the GUI or from
VBA) to handle these surrogate pairs?

The specific problem I'm dealing with is preparing documents for a
compositor. Characters they can't handle normally (esp. CJK) need to be
replaced with the string "XXX."

The only workaround I've found is to search for *all* U+10000 and above
characters by doing a find/replace on the range U+0001 to U+FFEF, give all
such characters a characteristic such as Highlight or Hidden that isn't
otherwise used in the document, then searching for all non-highlighted or
non-hidden text and replacing it, then turning the remaining text normal
again. This is risky (it assumes that Hidden or whatever other feature isn't
already in use), and it would be nice eventually to be able to search for
these ranges directly.

Any suggestions out there?

Bruce

> Actually, more info here:
>
[quoted text clipped - 84 lines]
>>>>>>
>>>>>> Bruce Rusk
Michael (michka) Kaplan [MS] - 24 Oct 2005 17:27 GMT
You could actually search for the high surrogate ranges in question, and
then every time you find one:

1) replace the code point
2) delete the next code point

Or you could instead of #2 do a second search on the low surrogate range and
replace them with a ZLS.

Like I said, it is all about recognizing that you should be thinking about
the underlying storage more.

Signature

MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.

> Thanks, Michael, that's a great explanation
>
[quoted text clipped - 110 lines]
>>>>>>>
>>>>>>> Bruce Rusk
Tony Jollans - 24 Oct 2005 19:37 GMT
The key here, I think, is recognising the difference between a 'character'
and a glyph. Code points above U+FFFF are two (UTF-16) characters each. The
Word GUI recognises them and displays them as a single glyph (the correct
one if you have an appropriate font installed, else a square). There are
parallels to this already in Word, like the end of (table) cell mark which
is chr(13) and chr(7). Outside the GUI, however, these code points are pairs
of characters and must be treated as such.

When you try and Find a single code point, Word searches for the two
(consecutive) characters which make it up and that works as it would with
any other two consecutive characters. When you try and include the two
characters as bounds of a wildcard range, Find/Replace will try to look for
the either the (first half of the first pair) or anything in the range
(second half of the first pair) to (first half of the second pair) or the
(second half of second pair). However due to the nature of surrogate pairs,
the (first half of the second pair) is always less than the (second half of
the first pair) and so Find/Replace declares the range invalid.

Recognising that it works with single UTF-16 characters allows a workaround.
Any code point above U+FFFF is represented by  two characters - the first in
the range U+D800 to U+DBFF, the second in the  range U+DC00 to U+DFFF - so
searching for that like this - [U+D800-U+DBFF][U+DC00-U+DFFF] *should*
return what you want. Unfortunately you run into another problem - you can't
enter U+DC00 to U+DFFF (Word thinks they're not valid). However, as you're
only looking for a single character you can use ? giving a find string of
[U+D800-U+DBFF]? - with a replace string of XXX hitting Replace All seems to
do the trick.

--
Enjoy,
Tony

> Thanks, Michael, that's a great explanation
>
[quoted text clipped - 135 lines]
> >>>>>>
> >>>>>> Bruce Rusk
Bruce Rusk - 24 Oct 2005 20:25 GMT
Thanks Tony, that ALMOST works.

Searching for "[U+D800-U+DBFF]?" finds all the supplementary characters, but
ALSO the next character. Thus a find/replace for "[U+D800-U+DBFF]?" with
"XXX" replaces the string U+20001 followed by ABC with XXXBC (rather than
XXXABC, as one might expect).

Perhaps Word is misidentifying which character I want to replace because it
"finds" the first character in the pair but sees a single Plane 1 character
for replacement purposes.

In any case I found a solution: while Word won't let you enter the second
character in the surrogate pair (i.e., U+DC00 to U+DFFF) from the Find
dialog, it WILL accept these characters in a string in VBA.

Thus

    [U+D800-U+DBFF][U+DC00-U+DFFF]

can be input via

SomeRange.Find.Text = "[" & ChrW(&HD800) & "-" & ChrW(&HDBFF) & _
   "][" &  ChrW(&HDC00) & "-" & ChrW(&HDFFF) & "]"

Seems to work fine.

Thanks Tony and Michael for all your help,

Bruce

> The key here, I think, is recognising the difference between a 'character'
> and a glyph. Code points above U+FFFF are two (UTF-16) characters each.
[quoted text clipped - 36 lines]
> Enjoy,
> Tony
Tony Jollans - 24 Oct 2005 20:54 GMT
You are correct. Perverse little beast isn't it? Glad you got something
working.

The Find object persists, so if you run the VBA once (just to set the Text)
you will then find the Find string in the UI contains the characters you
couldn't type and running the Replace from the UI then appears to work.

--
Enjoy,
Tony

> Thanks Tony, that ALMOST works.
>
[quoted text clipped - 66 lines]
> > Enjoy,
> > Tony
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.