Jump to content

Encoding issue


Recommended Posts

It seems app has an issue is related to incorrect encoding of {query} / script block.
 
For instance I'm typing "й" symbol and app sends it as 4 bytes, but actually it is 2 bytes in utf-8.
 
Could you please explain me:

  • How does app encode {query} / script block?
  • What kind of encodings does it use for that?

If you will try to run "init.rb" via terminal you will get:
 

ruby -Ku "init.rb" "й"
<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item></items>

BUT if you will try to run code via workflow you will get:

<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b8cc86"], size: 4</subtitle><icon>icon.png</icon></item></items>

You can find the workflow here.
 

OS X 10.8.3
Alfred 2.0.2 (178)
Ruby 1.8.7 / 1.9.x / 2.0.x
Edited by mdreizin
Link to comment

It seems app has an issue is related to incorrect encoding of {query} / script block.

 

For instance I'm typing "й" symbol and app sends it as 4 bytes, but actually it is 2 bytes in utf-8.

 

Could you please explain me:

  • How does app encode {query} / script block?
  • What kind of encodings does it use for that?

If you will try to run "init.rb" via terminal you will get:

 

ruby -Ku "init.rb" "й"
<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item></items>

BUT if you will try to run code via workflow you will get:

<?xml version="1.0"?><items><item uid="stats-0" valid="yes"><title>System</title><subtitle>hex: ["d0b9"], size: 2</subtitle><icon>icon.png</icon></item><item uid="stats-1" valid="yes"><title>Query</title><subtitle>hex: ["d0b8cc86"], size: 4</subtitle><icon>icon.png</icon></item></items>

You can find the workflow here.

 

OS X 10.8.3
Alfred 2.0.2 (178)
Ruby 1.8.7 / 1.9.x / 2.0.x

 

Alfred uses NSTask to bridge across to the scripting language. Cocoa automatically normalises any passed in arguments with decomposition which splits the characters down as you see. You'll need to re-normalise into the format you need.

 

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

 

I've done plenty of research over this, one beta build even had a way to set the normalisation type in the workflow, but NSTask always re-normalised so it was removed.

 

Cheers,

Andrew

Link to comment

Alfred uses NSTask to bridge across to the scripting language. Cocoa automatically normalises any passed in arguments with decomposition which splits the characters down as you see. You'll need to re-normalise into the format you need.

 

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

 

I've done plenty of research over this, one beta build even had a way to set the normalisation type in the workflow, but NSTask always re-normalised so it was removed.

 

Cheers,

Andrew

 

Andrew, thanks a lot for the reply. I will try to de-normalize strings in my code.

Edited by mdreizin
Link to comment

Andrew, thanks a lot for the reply. I will try to de-normalize strings in my code.

 

I've created a small command line tool which should hopefully help you re-normalise any strings:

 

https://dl.dropboxusercontent.com/u/6749767/Alfred/normalise.zip

 

If you include this in your workflow itself, you should be able to run it directly like this:

 

usage: ./normalise -form NFC й

 

You can add -verbose after NFC to see what is happening, or no arguments to see the options.

 

Let me know if that helps at all :)

Link to comment
  • 2 weeks later...

I've created a small command line tool which should hopefully help you re-normalise any strings:

 

https://dl.dropboxusercontent.com/u/6749767/Alfred/normalise.zip

 

If you include this in your workflow itself, you should be able to run it directly like this:

 

usage: ./normalise -form NFC й

 

You can add -verbose after NFC to see what is happening, or no arguments to see the options.

 

Let me know if that helps at all :)

 

 
Hi Andrew,
 
It helps me to resolve my issue.
 
Thanks a lot ;)
Link to comment
  • 4 months later...
  • 7 months later...

@Andrew's normalise utility works great, but I've since found that there is an alternative using the standard utility iconv with the (somewhat obscurely named)  UTF8-MAC encoding scheme:

 

Note: The following examples use bash.

iconv expects its input via a filename or stdin.

 

Applied to the example above:

 

  # Converts NFD form of 'й' to NFC form

iconv -f UTF8-MAC <<<'й'

 

Some background:

The following examples use input string 'ü'

  • in NFC form, $'\xc3\xbc' - i.e., bytes 0xC3 0xBC, which is the UTF8 encoding of Unicode codepoint 0xFC
  • in NFD form, $'u\xcc\x88' - i.e., a u - the base character - followed by bytes 0xCC 0x88, which is the UTF8 encoding of Unicode codepoint 0x308, the so-called combining diaeresis (¨).

to demonstrate converting; note that in Terminal the result will always appear as ü - pipe to hexdump -C, for instance, to see the byte values.

  # NFC -> NFD
iconv -t UTF8-MAC <<<$'\xc3\xbc' # -> $'u\xcc\x88'

  # NFD -> NFC
iconv -f UTF8-MAC <<<$'u\xcc\x88' # -> $'\xc3\xbc'
 

These conversions are safe to use in that if the input string is already in the target format, it is left as is. 

Edited by mklement0
Link to comment
  • 1 month later...

@Andrew's normalise utility works great, but I've since found that there is an alternative using the standard utility iconv with the (somewhat obscurely named)  UTF8-MAC encoding scheme:

 

@mklement0 Thanks for this tip. iconv works great for most strings, but I found that it does not work for some emoji. Pile of poo for instance. It gives the following error:

[ERROR: alfred.workflow.input.scriptfilter] Code 0: iconv: (stdin):1:4: cannot convert

It's a shame, because it's such an elegant solution, otherwise. For posterity, I'll add that this is happening in OS X 10.9.3 (libiconv 1.11), hopefully a future version fixes this.

 

@Andrew's script seems to be handling all emoji correctly. I'm going to use that for now.

Edited by chadv
Link to comment
  • 1 month later...

but I found that it does not work for some emoji.

 

@chadv: Thanks for investigating and letting me know. Shame indeed, especially given that it hasn't been fixed in OS X 10.10 (the current public beta), which still ships with the same libiconv version (1.11).

 

Curiously, 3- and 6-byte UTF8 emoji sequences as well as those 7-byte sequences that start with an ASCII char. byte (followed by combining characters) do work properly, but the majority of emoji (4-byte sequences) do not.

 

On a side note, Terminal.app, while *rendering* emoji as expected, doesn't handle them properly in terms of cursor placement, printing the next character, and backspacing. 6-, 7-, 8-byte sequences seemingly involve combining characters, and are misinterpreted as comprising *2 or 3* characters, which has all sorts of unwanted side effects.

Edited by mklement0
Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...