Encoding in script filter input

nicke5012 · September 4, 2015

I'm trying to make my workflow robust against special characters, and I'm having a bit of trouble. When I run my python script from the terminal with an argument containing special characters it works fine. Eg: python alfredwl.py show "a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß"

However, when I run it through Alfred as "wlshow a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß" in a script filter my workflow seems to stumble on the character encoding. In my python script I convert arguments from UTF-8 into Unicode strings--I figured that would've worked for Alfred. What type of encoded string does Alfred's script filter pass onto the Bash script running?

Thanks,

Nick

vitor · September 4, 2015

Script Filters output XML, so that’s what you have to keep in mind. What you’re looking for is CDATA. You use it like so: <!CDATA[Whatever the heck you want, here]>. So you can have a line in your Script Filter be something like <subtitle><![CDATA[a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß]]></subtitle> and it’ll work.

Edited September 4, 2015 by Vítor

deanishe · September 5, 2015

Script Filters output XML, so that’s what you have to keep in mind. What you’re looking for is CDATA. You use it like so: <!CDATA[Whatever the heck you want, here]>. So you can have a line in your Script Filter be something like <subtitle><![CDATA[a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß]]></subtitle> and it’ll work.

This probably won't help: it's a problem specific to Python's (dumbass) encoding/decoding behaviour

I'm trying to make my workflow robust against special characters, and I'm having a bit of trouble. When I run my python script from the terminal with an argument containing special characters it works fine. Eg: python alfredwl.py show "a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß"

However, when I run it through Alfred as "wlshow a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß" in a script filter my workflow seems to stumble on the character encoding. In my python script I convert arguments from UTF-8 into Unicode strings--I figured that would've worked for Alfred. What type of encoded string does Alfred's script filter pass onto the Bash script running?

Thanks,

Nick

Alfred uses exclusively UTF-8 input and output.

String encoding in Python 2 is a bit of a bitch, unfortunately.

From a brief perusal of the source code on GitHub, your script actually does not convert any input from UTF-8 to Unicode.

This isn't necessarily a problem with Python 2, as it can work just fine with encoded strings, but you're using the json module, which returns Unicode, and trying to encode (potentially bytestrings) with ElementTree. If you try to call a Unicode function with a bytestring, Python 2 will typically try to decode it with the ASCII codec first, which is why your workflow works just fine with ASCII input, but blows up when given non-ASCII input.

I haven't installed or run the code, but I think changing line 144 to query = sys.argv[2].decode('utf-8') will fix the problem.

The Golden Rule with text encoding/decoding in Python is to do it at IO boundaries (i.e. decode all incoming text, encode all outgoing text).

Seeing as you're writing a workflow in Python, might I suggest you have a look at the Alfred-Workflow library, which will take care of most of the encoding/decoding for you (and a lot of the other mindless crap, too)?

Edited September 5, 2015 by deanishe

deanishe · September 5, 2015

Addendum: Here's an intro to string encoding I wrote, with a focus on Alfred workflows (but more particularly Alfred-Workflow).

nicke5012 · September 6, 2015

Ah ha! Thanks for your response deanishe! And thanks for looking into the code. I don't have the latest version on GitHub, but in it I do decode the text at the boundaries as you suggested below.

I haven't installed or run the code, but I think changing line 144 to query = sys.argv[2].decode('utf-8') will fix the problem.

That didn't solve the problem though, but with a bit of your help I was able to determine that the issue arises from the response from Wunderlist's API--it doesn't respond with UTF-8 as expected/documented. For example, for the troublesome request it comes back with:

[ {u'created_at': u'2015-09-04T15:42:32.668Z',
  u'id': 182850841,
  u'list_type': u'list',
  u'owner_type': u'user',
  u'public': False,
  u'title': u'a\xb4\xb4\xb4\xb4\xc5\xc5\xc5\xb4\xb4\xc5\xce\xce\xce\xcf\xcd\xd2\u02c6\u201e\xcf\u0131\xd4\xcd\xd3\xd4\xcd\xcf\u0131\xc5\xcd\xaf\xc5\xcd\xe5\xe5\u2202\xdf\u02da\u02dc\u2264\u221a\xe5\xdf',
  u'type': u'list'}]

Even if I try to take that title piece and use that string as a byte string and decode it the conversion doesn't work. Eg.

b'a\xb4\xb4\xb4\xb4\xc5\xc5\xc5\xb4\xb4\xc5\xce\xce\xce\xcf\xcd\xd2\u02c6\u201e\xcf\u0131\xd4\xcd\xd3\xd4\xcd\xcf\u0131\xc5\xcd\xaf\xc5\xcd\xe5\xe5\u2202\xdf\u02da\u02dc\u2264\u221a\xe5\xdf'.decode('utf-8')

I'll need to dig into it a bit more (if you have any suggestions they'd be much appreciated!) but just wanted to say thanks for the response. It did put me on the right track!

Edited September 6, 2015 by nicke5012

deanishe · September 6, 2015

The API response seems perfectly valid (it was properly decoded).

You can't decode this:

u'a\xb4\xb4\xb4\xb4\xc5\xc5\xc5\xb4\xb4\xc5\xce\xce\xce\xcf\xcd\xd2\u02c6\u201e\xcf\u0131\xd4\xcd\xd3\xd4\xcd\xcf\u0131\xc5\xcd\xaf\xc5\xcd\xe5\xe5\u2202\xdf\u02da\u02dc\u2264\u221a\xe5\xdf'

because it's a Unicode string, i.e. it's already been decoded. Try this instead:

# This is a Unicode string
>>> u = u'a\xb4\xb4\xb4\xb4\xc5\xc5\xc5\xb4\xb4\xc5\xce\xce\xce\xcf\xcd\xd2\u02c6\u201e\xcf\u0131\xd4\xcd\xd3\xd4\xcd\xcf\u0131\xc5\xcd\xaf\xc5\xcd\xe5\xe5\u2202\xdf\u02da\u02dc\u2264\u221a\xe5\xdf'
>>> b = u.encode('utf-8')
>>> b
# This is the UTF-8 string
'a\xc2\xb4\xc2\xb4\xc2\xb4\xc2\xb4\xc3\x85\xc3\x85\xc3\x85\xc2\xb4\xc2\xb4\xc3\x85\xc3\x8e\xc3\x8e\xc3\x8e\xc3\x8f\xc3\x8d\xc3\x92\xcb\x86\xe2\x80\x9e\xc3\x8f\xc4\xb1\xc3\x94\xc3\x8d\xc3\x93\xc3\x94\xc3\x8d\xc3\x8f\xc4\xb1\xc3\x85\xc3\x8d\xc2\xaf\xc3\x85\xc3\x8d\xc3\xa5\xc3\xa5\xe2\x88\x82\xc3\x9f\xcb\x9a\xcb\x9c\xe2\x89\xa4\xe2\x88\x9a\xc3\xa5\xc3\x9f'
>>> print(
a´´´´ÅÅÅ´´ÅÎÎÎÏÍÒˆ„ÏıÔÍÓÔÍÏıÅÍ¯ÅÍåå∂ß˚˜≤√åß

and it works.

I can't say what the problem is without the latest version of the code you're actually running.

The root issue is probably that Alfred has an empty environment. If your shell environment is set up correctly, you can print Unicode strings and they will be encoded to UTF-8 on output:

# in a properly-configured shell
echo $LANG
en_GB.UTF-8
python
>>> print u'\xfcnic\xf8de'
ünicøde

# in an empty environment
env -i python
>>> print u'\xfcnic\xf8de'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)

Edited September 6, 2015 by deanishe

nicke5012 · September 9, 2015

The API response seems perfectly valid (it was properly decoded).

Hi deanishe, ah looks like you're correct-- the API is returning UTF-8! Okay so that's not the problem. I still can't get the extension to work with special characters though, and would love some more of your help. I think there's still something I'm misunderstanding about how Alfred is encoding characters.

In the workflow's Github repo, I added a simple "test" workflow in a subdirectory to illustrate the issue. Ideally with the test workflow, the Alfred command "test Å" should return "234"--you'll see from the workflow that I'm trying to access a dict with special characters as a key. That doesn't seem to work, although running the python script from the command line via "python test Å" seems to work just fine. Would love to know your thoughts here.

Also, I put the latest code for the extension I'm working on on Github. I'm hitting the error at line 53.

Thanks for all your help!

deanishe · September 9, 2015

This issue is explained, along with the solution, in the article I posted earlier.

if __name__ == '__main__':
    query = sys.argv[1].decode('utf-8')
    sys.stderr.write('test_dict : {!r} query: {!r}\n'.format(test_dict, query))
    ...

produces in Alfred's debugger:

test_dict : {'b': 2, u'\xc5': 234} query: u'A\u030a'
Traceback (most recent call last):
  File "test.py", line 37, in <module>
    output_value = test_dict[query]
KeyError: u'A\u030a'

Obviously, u'\xc5' != u'A\u030a'.

What's happening is you have two differently-normalised representations of the same character. u'\xc5' is just "Å", while u'A\u030a' is a "decomposed" form, effectively equivalent to "A+º". OS X uses the latter form (NFD) by default, while Python uses the former (NFC).

The solution is to ensure your input is normalised to the same form as your data:

if __name__ == '__main__':
    from unicodedata import normalize
    query = normalize('NFC', sys.argv[1].decode('utf-8'))
    sys.stderr.write('test_dict : {!r} query: {!r}\n'.format(test_dict, query))

produces in the debugger:

 test_dict : {'b': 2, u'\xc5': 234} query: u'\xc5'

and your workflow works as expected.

nicke5012 · September 9, 2015

That was it! Thanks Deanishe! Super super helpful all around.

And thanks for making the Alfred-Workflow module. I'll be looking to incorporate it into future workflows. This time around though I wanted to slog through it to figure out exactly what was going on--for example I'd never heard of unicode normalization before.

Thanks again!

deanishe · September 9, 2015

This time around though I wanted to slog through it to figure out exactly what was going on--for example I'd never heard of unicode normalization before.

A very good idea, to be sure. Alfred-Workflow is very helpful if you just want to get stuff done quickly, but you won't learn too much about Alfred using it.

This other page from the Alfred-Workflow docs might be of interest to you (it explains how the various XML options work and interact).

deanishe · September 9, 2015

One more thing: if you're getting an error in a script/workflow, don't just say, "I'm getting an error". Post the code causing the error, the traceback you're getting (if there is one) and the input, output and expected output.

If you'd run sys.stderr.write('test_dict : {!r} query: {!r}\n'.format(test_dict, query)) at the start and posted the output, I could have told you 5 days ago what the problem was.

It would have saved us both a lot of time (you much more so than me).

Edited September 9, 2015 by deanishe

Sign In

Encoding in script filter input

Recommended Posts

nicke5012

Link to comment

vitor

Link to comment

deanishe

Link to comment

deanishe

Link to comment

nicke5012

Link to comment

deanishe

Link to comment

nicke5012

Link to comment

deanishe

Link to comment

nicke5012

Link to comment

deanishe

Link to comment

deanishe

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity