Encoding problem with ARGV[0]

alfred_enthusiast · September 15, 2018

I'm creating a workflow and I'm having some trouble to manage the input in non-ascii characters.
If I set the input as `argv` and type "bär", the script receives a string with length = 4 that is not == "bär"

Any suggestion how I should deal with this string?

deanishe · September 15, 2018

Can't tell you how to deal with it without knowing what programming language you're using. Ruby?

In any case, the problem is that "bär" may only be three letters, but it is four bytes (or can be). You need to use a Unicode-aware language/functions to get "correct" lengths.

Here's a demonstration in Python:

>>> s = 'bär'
>>> len(s)
4
>>> s
'b\xc3\xa4r'
>>> u = unicode(s, 'utf-8')
>>> len(u)
3
>>> u
u'b\xe4r'
>>>

As you can see in the line 'b\xc3\xa4r', there are four bytes, b, \xc3, \xa4 and r in the UTF-8 string.

When I decode it to a Unicode string, len returns the correct number of characters.

Edited September 15, 2018 by deanishe

alfred_enthusiast · September 15, 2018

thanks, I'm using Ruby.
In particularly I'm using

ruby 2.6.0dev (2018-06-22 trunk 63723) [x86_64-darwin17]

I see that the length of the input string is 4 for "bär" but if I type

"bär".length

The result is 3

Forgot to mention, if I run the script via command line, like

./mzScript.rb bär

the result is exactly what I expect.

Edited September 15, 2018 by alfred_enthusiast

deanishe · September 15, 2018

17 minutes ago, alfred_enthusiast said:

I see that the length of the input string is 4 for "bär" but if I type

That's because Alfred (or more precisely NSTask) uses a decomposed form of Unicode. In Python again:

>>> s = u'bär'
>>> s
u'b\xe4r'
>>> len(s)
3
>>> s2 = normalize('NFD', s)
>>> s2
u'ba\u0308r'
>>> len(s2)
4

As you can see, s contains three codepoints: b, \xe4 (ä) and r, but s2 contains four codepoints, b, a, \u0308 and r.

\u0308 is the COMBINING DIAERESIS character. "Diaeresis" is the proper English name for "umlaut" and "combining" means "add it to the previous codepoint", i.e. "put an umlaut on the 'a'."

In this particular situation, you need to normalise the string to form NFC. That turns s2 into s.

That will give you the result you expect in this case, but fundamentally the lengths of strings, Unicode or encoded, and the number of codepoints in a Unicode string do not correspond to the number of characters in the rendered text.

Edited September 15, 2018 by deanishe
Add links

alfred_enthusiast · September 15, 2018

you're my hero

I spent one day trying to figuring out how the two strings can be different.

To make it work had only to normalise the input

# from this
input = ARGV[0].downcase

# to this
input = ARGV[0].downcase.unicode_normalize

I can now finally complete my workflow and share on github. I'll mention this post of course

76781144_ScreenShot2018-09-15at17_36_49.png.07da465d0d78b9b4f48cc6315fe0ea1c.png

deanishe · September 15, 2018

Ooo. That looks useful. Be sure to post the workflow on the forum. I want it.

alfred_enthusiast · September 15, 2018

as promised this is the workflow. I'll post on the other section.
https://github.com/ignazioc/DerDieDas

deanishe · September 15, 2018

Ah, it's dict.cc. Which dictionary does the workflow expect? EN -> DE or DE -> EN? Or is either okay?

You should probably add a download link to the README on GitHub for people who don't find the workflow via the forum. Perhaps create a GitHub release, which is where most people expect to find downloads.

deanishe · September 15, 2018

You need to rebuild the workflow, I think.

I just installed it and it doesn't work. In dict_cc.rb it says FILENAME = 'cmodbnkkcf-52898166-ea5ea5.txt'.freeze instead of FILENAME = 'full_dictionary.txt'.freeze

One tip: It would be a good idea to create the indices in the workflow's cache or data folder, not in the workflow folder itself. With the indices, the workflow is over 100MB, which is not very sync-friendly (standard Dropbox is only 2GB and 5% of that for one workflow is not ideal).

alfred_enthusiast · September 15, 2018

@deanishe Thanks for all your suggestions. Definitely very helpful.

Two questions:
1. why did you mention Dropbox? Is common approach to sync workflows over dropbox?
2. I would like to replace the indexes with a real db (sqllite) but for that I need some specific ruby gems. How can I execute an "installation script" when the workflow is installed?

deanishe · September 15, 2018

Dropbox is the standard method of syncing workflows and settings between machines. Other sync services don’t work very well.

You don’t run an installation script. You bundle the gems with the workflow.

Sign In

Encoding problem with ARGV[0]

Recommended Posts

alfred_enthusiast

Link to comment

deanishe

Link to comment

alfred_enthusiast

Link to comment

deanishe

Link to comment

alfred_enthusiast

Link to comment

deanishe

Link to comment

alfred_enthusiast

Link to comment

deanishe

Link to comment

deanishe

Link to comment

alfred_enthusiast

Link to comment

deanishe

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity