Encoding issue

mdreizin · April 16, 2013

It seems app has an issue is related to incorrect encoding of {query} / script block.

For instance I'm typing "й" symbol and app sends it as 4 bytes, but actually it is 2 bytes in utf-8.

Could you please explain me:

How does app encode {query} / script block?
What kind of encodings does it use for that?

If you will try to run "init.rb" via terminal you will get:

ruby -Ku "init.rb" "й"

<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item></items>

BUT if you will try to run code via workflow you will get:

<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b8cc86"], size: 4</subtitle><icon>icon.png</icon></item></items>

You can find the workflow here.

OS X 10.8.3
Alfred 2.0.2 (178)
Ruby 1.8.7 / 1.9.x / 2.0.x

Edited April 16, 2013 by mdreizin

Andrew · April 16, 2013

It seems app has an issue is related to incorrect encoding of {query} / script block.

For instance I'm typing "й" symbol and app sends it as 4 bytes, but actually it is 2 bytes in utf-8.

Could you please explain me:

How does app encode {query} / script block?

What kind of encodings does it use for that?

If you will try to run "init.rb" via terminal you will get:
ruby -Ku "init.rb" "й"
<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item></items>
BUT if you will try to run code via workflow you will get:
<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b8cc86"], size: 4</subtitle><icon>icon.png</icon></item></items>
You can find the workflow here.
OS X 10.8.3
Alfred 2.0.2 (178)
Ruby 1.8.7 / 1.9.x / 2.0.x

Alfred uses NSTask to bridge across to the scripting language. Cocoa automatically normalises any passed in arguments with decomposition which splits the characters down as you see. You'll need to re-normalise into the format you need.

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

I've done plenty of research over this, one beta build even had a way to set the normalisation type in the workflow, but NSTask always re-normalised so it was removed.

Cheers,

Andrew

mdreizin · April 16, 2013

Alfred uses NSTask to bridge across to the scripting language. Cocoa automatically normalises any passed in arguments with decomposition which splits the characters down as you see. You'll need to re-normalise into the format you need.

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

I've done plenty of research over this, one beta build even had a way to set the normalisation type in the workflow, but NSTask always re-normalised so it was removed.

Cheers,

Andrew

Andrew, thanks a lot for the reply. I will try to de-normalize strings in my code.

Edited April 16, 2013 by mdreizin

Andrew · April 17, 2013

Andrew, thanks a lot for the reply. I will try to de-normalize strings in my code.

I've created a small command line tool which should hopefully help you re-normalise any strings:

https://dl.dropboxusercontent.com/u/6749767/Alfred/normalise.zip

If you include this in your workflow itself, you should be able to run it directly like this:

usage: ./normalise -form NFC й

You can add -verbose after NFC to see what is happening, or no arguments to see the options.

Let me know if that helps at all

hubertcampan · April 21, 2013

Hi Andrew,

It did help me on this topic http://www.alfredforum.com/topic/1907-bash-script-in-workflow-language-of-locals-accented-characters/?p=11799 pointed by _mk_

Thanks.

Hubert

Andrew · April 21, 2013

Hi Andrew,

It did help me on this topic http://www.alfredforum.com/topic/1907-bash-script-in-workflow-language-of-locals-accented-characters/?p=11799 pointed by _mk_

Thanks.

Hubert

Excellent - I tried to find your topic to post it there too, glad you found it!

mdreizin · May 5, 2013

I've created a small command line tool which should hopefully help you re-normalise any strings:

https://dl.dropboxusercontent.com/u/6749767/Alfred/normalise.zip

If you include this in your workflow itself, you should be able to run it directly like this:

usage: ./normalise -form NFC й

You can add -verbose after NFC to see what is happening, or no arguments to see the options.

Let me know if that helps at all

Hi Andrew,

It helps me to resolve my issue.

Thanks a lot

m0nah · September 12, 2013

Andrew,

It helps me to resolve my issue too.

Thanks!

mklement0 · May 12, 2014

@Andrew's normalise utility works great, but I've since found that there is an alternative using the standard utility iconv with the (somewhat obscurely named) UTF8-MAC encoding scheme:

Note: The following examples use bash.

iconv expects its input via a filename or stdin.

Applied to the example above:

# Converts NFD form of 'й' to NFC form

iconv -f UTF8-MAC <<<'й'

Some background:

The following examples use input string 'ü'

in NFC form, $'\xc3\xbc' - i.e., bytes 0xC3 0xBC, which is the UTF8 encoding of Unicode codepoint 0xFC
in NFD form, $'u\xcc\x88' - i.e., a u - the base character - followed by bytes 0xCC 0x88, which is the UTF8 encoding of Unicode codepoint 0x308, the so-called combining diaeresis (¨).

to demonstrate converting; note that in Terminal the result will always appear as ü - pipe to hexdump -C, for instance, to see the byte values.

# NFC -> NFD
iconv -t UTF8-MAC <<<$'\xc3\xbc' # -> $'u\xcc\x88'

# NFD -> NFC
iconv -f UTF8-MAC <<<$'u\xcc\x88' # -> $'\xc3\xbc'

These conversions are safe to use in that if the input string is already in the target format, it is left as is.

Edited May 12, 2014 by mklement0

chadv · June 19, 2014

@Andrew's normalise utility works great, but I've since found that there is an alternative using the standard utility iconv with the (somewhat obscurely named) UTF8-MAC encoding scheme:

@mklement0 Thanks for this tip. iconv works great for most strings, but I found that it does not work for some emoji. Pile of poo for instance. It gives the following error:

[ERROR: alfred.workflow.input.scriptfilter] Code 0: iconv: (stdin):1:4: cannot convert

It's a shame, because it's such an elegant solution, otherwise. For posterity, I'll add that this is happening in OS X 10.9.3 (libiconv 1.11), hopefully a future version fixes this.

@Andrew's script seems to be handling all emoji correctly. I'm going to use that for now.

Edited June 19, 2014 by chadv

mklement0 · August 11, 2014

but I found that it does not work for some emoji.

@chadv: Thanks for investigating and letting me know. Shame indeed, especially given that it hasn't been fixed in OS X 10.10 (the current public beta), which still ships with the same libiconv version (1.11).

Curiously, 3- and 6-byte UTF8 emoji sequences as well as those 7-byte sequences that start with an ASCII char. byte (followed by combining characters) do work properly, but the majority of emoji (4-byte sequences) do not.

On a side note, Terminal.app, while *rendering* emoji as expected, doesn't handle them properly in terms of cursor placement, printing the next character, and backspacing. 6-, 7-, 8-byte sequences seemingly involve combining characters, and are misinterpreted as comprising *2 or 3* characters, which has all sorts of unwanted side effects.

Edited August 23, 2014 by mklement0

Sign In

Encoding issue

Recommended Posts

mdreizin

Link to comment

Andrew

Link to comment

mdreizin

Link to comment

Andrew

Link to comment

hubertcampan

Link to comment

Andrew

Link to comment

mdreizin

Link to comment

m0nah

Link to comment

mklement0

Link to comment

chadv

Link to comment

mklement0

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity