MS Office Forum / Word / General MS Word Questions / October 2005
Bug? searching for high unicode ranges
|
|
Thread rating:  |
Bruce Rusk - 18 Oct 2005 02:25 GMT I seem to have stumbled across a bug in Word's (I have 2003 on XP Pro English) handling of certain Unicode ranges in searches.
I have the extended Chinese font SimSun (Founder Extended) installed from the Proofing Tools, and it contains a number of characters from the CJK Unified Ideographs Extension (codepoints U+20000 and up). I can do a normal find, either from the UI or with VBA manipulation of the Find object, for characters in this range without a problem. But Word doesn't seem to be able to handle wildcard searches (in the format [A-Z] for a range of characters from this part of Unicode. A little experimentation suggests that it can't handle anything from U+10000 up.
To replicate the problem, open the Find dialog, check "Use Wildcards," and enter the following:
"[10000" then click Alt-X to insert the character, then a dash ("-"), then "10001" then Alt-X again, then "]"
Two boxes should replace the characters (unless you have a font installed that shows Linear B). When I perform the search, I get the following error message:
"The Find What text contains a range that is not valid."
Word should be able to handle this, since:
- You can do the the search in this way for any character range below U+10000
- You can search individually for characters U+10000 and above (try inserting one and searching for it)
Can anyone else replicate this problem? And if anyone is beta-testing Word 12, can you test if the same problem persists there (and if so, fire off a bug report?).
Thanks,
Bruce Rusk
Tony Jollans - 18 Oct 2005 15:20 GMT It wouldn't surprise me at all if you have found a limitation of Word. I believe Word uses UTF-16 encoding which can't cope with characters above U+10000.
-- Enjoy, Tony
> I seem to have stumbled across a bug in Word's (I have 2003 on XP Pro > English) handling of certain Unicode ranges in searches. [quoted text clipped - 37 lines] > > Bruce Rusk Michael (michka) Kaplan [MS] - 18 Oct 2005 16:43 GMT Correct -- you have to the surrogate pairs to find the text.
 Signature MichKa [Microsoft] NLS Collation/Locale/Keyboard Technical Lead Globalization Infrastructure, Fonts, and Tools Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with no warranties, and confers no rights.
> It wouldn't surprise me at all if you have found a limitation of Word. I > believe Word uses UTF-16 encoding which can't cope with characters above [quoted text clipped - 53 lines] >> >> Bruce Rusk Bruce Rusk - 18 Oct 2005 17:33 GMT But you can find such characters individually if you have one in the text and then search for it. You can find it through the normal search dialog. It's only when you search for it as a wildcard (either as a range, in the original post, or just a single-character wildcard, [X]) that it fails.
How would a surrogate pair search work?
> Correct -- you have to the surrogate pairs to find the text. > [quoted text clipped - 58 lines] >>> >>> Bruce Rusk Michael (michka) Kaplan [MS] - 21 Oct 2005 10:43 GMT Well, perhaps there is a bug there.... but the key would be to think of the supplementary character as a surrogate pair so that you are searching for the two UTF-16 code points. Ranges will not work, unless they try to take this into account.
I am not sure of the syntax you are using, but the wildward failure may be similar....
 Signature MichKa [Microsoft] NLS Collation/Locale/Keyboard Technical Lead Globalization Infrastructure, Fonts, and Tools Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with no warranties, and confers no rights.
> But you can find such characters individually if you have one in the text > and then search for it. You can find it through the normal search dialog. [quoted text clipped - 66 lines] >>>> >>>> Bruce Rusk Michael (michka) Kaplan [MS] - 24 Oct 2005 16:07 GMT Actually, more info here:
http://blogs.msdn.com/michkap/archive/2005/10/24/483965.aspx
:-)
 Signature MichKa [Microsoft] NLS Collation/Locale/Keyboard Technical Lead Globalization Infrastructure, Fonts, and Tools Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with no warranties, and confers no rights.
> Well, perhaps there is a bug there.... but the key would be to think of > the supplementary character as a surrogate pair so that you are searching [quoted text clipped - 77 lines] >>>>> >>>>> Bruce Rusk Bruce Rusk - 24 Oct 2005 17:18 GMT Thanks, Michael, that's a great explanation
So the question that raises for me is ....
Is there a way to use Word's find/replace interface (from the GUI or from VBA) to handle these surrogate pairs?
The specific problem I'm dealing with is preparing documents for a compositor. Characters they can't handle normally (esp. CJK) need to be replaced with the string "XXX."
The only workaround I've found is to search for *all* U+10000 and above characters by doing a find/replace on the range U+0001 to U+FFEF, give all such characters a characteristic such as Highlight or Hidden that isn't otherwise used in the document, then searching for all non-highlighted or non-hidden text and replacing it, then turning the remaining text normal again. This is risky (it assumes that Hidden or whatever other feature isn't already in use), and it would be nice eventually to be able to search for these ranges directly.
Any suggestions out there?
Bruce
> Actually, more info here: > [quoted text clipped - 84 lines] >>>>>> >>>>>> Bruce Rusk Michael (michka) Kaplan [MS] - 24 Oct 2005 17:27 GMT You could actually search for the high surrogate ranges in question, and then every time you find one:
1) replace the code point 2) delete the next code point
Or you could instead of #2 do a second search on the low surrogate range and replace them with a ZLS.
Like I said, it is all about recognizing that you should be thinking about the underlying storage more.
 Signature MichKa [Microsoft] NLS Collation/Locale/Keyboard Technical Lead Globalization Infrastructure, Fonts, and Tools Blog: http://blogs.msdn.com/michkap
This posting is provided "AS IS" with no warranties, and confers no rights.
> Thanks, Michael, that's a great explanation > [quoted text clipped - 110 lines] >>>>>>> >>>>>>> Bruce Rusk Tony Jollans - 24 Oct 2005 19:37 GMT The key here, I think, is recognising the difference between a 'character' and a glyph. Code points above U+FFFF are two (UTF-16) characters each. The Word GUI recognises them and displays them as a single glyph (the correct one if you have an appropriate font installed, else a square). There are parallels to this already in Word, like the end of (table) cell mark which is chr(13) and chr(7). Outside the GUI, however, these code points are pairs of characters and must be treated as such.
When you try and Find a single code point, Word searches for the two (consecutive) characters which make it up and that works as it would with any other two consecutive characters. When you try and include the two characters as bounds of a wildcard range, Find/Replace will try to look for the either the (first half of the first pair) or anything in the range (second half of the first pair) to (first half of the second pair) or the (second half of second pair). However due to the nature of surrogate pairs, the (first half of the second pair) is always less than the (second half of the first pair) and so Find/Replace declares the range invalid.
Recognising that it works with single UTF-16 characters allows a workaround. Any code point above U+FFFF is represented by two characters - the first in the range U+D800 to U+DBFF, the second in the range U+DC00 to U+DFFF - so searching for that like this - [U+D800-U+DBFF][U+DC00-U+DFFF] *should* return what you want. Unfortunately you run into another problem - you can't enter U+DC00 to U+DFFF (Word thinks they're not valid). However, as you're only looking for a single character you can use ? giving a find string of [U+D800-U+DBFF]? - with a replace string of XXX hitting Replace All seems to do the trick.
-- Enjoy, Tony
> Thanks, Michael, that's a great explanation > [quoted text clipped - 135 lines] > >>>>>> > >>>>>> Bruce Rusk Bruce Rusk - 24 Oct 2005 20:25 GMT Thanks Tony, that ALMOST works.
Searching for "[U+D800-U+DBFF]?" finds all the supplementary characters, but ALSO the next character. Thus a find/replace for "[U+D800-U+DBFF]?" with "XXX" replaces the string U+20001 followed by ABC with XXXBC (rather than XXXABC, as one might expect).
Perhaps Word is misidentifying which character I want to replace because it "finds" the first character in the pair but sees a single Plane 1 character for replacement purposes.
In any case I found a solution: while Word won't let you enter the second character in the surrogate pair (i.e., U+DC00 to U+DFFF) from the Find dialog, it WILL accept these characters in a string in VBA.
Thus
[U+D800-U+DBFF][U+DC00-U+DFFF]
can be input via
SomeRange.Find.Text = "[" & ChrW(&HD800) & "-" & ChrW(&HDBFF) & _ "][" & ChrW(&HDC00) & "-" & ChrW(&HDFFF) & "]"
Seems to work fine.
Thanks Tony and Michael for all your help,
Bruce
> The key here, I think, is recognising the difference between a 'character' > and a glyph. Code points above U+FFFF are two (UTF-16) characters each. [quoted text clipped - 36 lines] > Enjoy, > Tony Tony Jollans - 24 Oct 2005 20:54 GMT You are correct. Perverse little beast isn't it? Glad you got something working.
The Find object persists, so if you run the VBA once (just to set the Text) you will then find the Find string in the UI contains the characters you couldn't type and running the Replace from the UI then appears to work.
-- Enjoy, Tony
> Thanks Tony, that ALMOST works. > [quoted text clipped - 66 lines] > > Enjoy, > > Tony
|
|
|