Jump to content

Encoding problem with ARGV[0]


Recommended Posts

Can't tell you how to deal with it without knowing what programming language you're using. Ruby?


In any case, the problem is that "bär" may only be three letters, but it is four bytes (or can be). You need to use a Unicode-aware language/functions to get "correct" lengths.


Here's a demonstration in Python:

 

>>> s = 'bär'
>>> len(s)
4
>>> s
'b\xc3\xa4r'
>>> u = unicode(s, 'utf-8')
>>> len(u)
3
>>> u
u'b\xe4r'
>>>

 

As you can see in the line 'b\xc3\xa4r', there are four bytes, b, \xc3, \xa4 and r in the UTF-8 string.

 

When I decode it to a Unicode string, len returns the correct number of characters.

 

Edited by deanishe
Link to comment

thanks, I'm using Ruby.
In particularly I'm using
 

ruby 2.6.0dev (2018-06-22 trunk 63723) [x86_64-darwin17]

I see that the length of the input string is 4 for "bär" but if I type

 

"bär".length

The result is 3 :(

Forgot to mention, if I run the script via command line, like


 

./mzScript.rb bär

the result is exactly what I expect.

 

 

182337051_ScreenShot2018-09-15at17_16_55.thumb.png.cd5f660fab17cba1f049c38019d0846d.png 

Edited by alfred_enthusiast
Link to comment
17 minutes ago, alfred_enthusiast said:

I see that the length of the input string is 4 for "bär" but if I type

 

That's because Alfred (or more precisely NSTask) uses a decomposed form of Unicode. In Python again:

 

>>> s = u'bär'
>>> s
u'b\xe4r'
>>> len(s)
3
>>> s2 = normalize('NFD', s)
>>> s2
u'ba\u0308r'
>>> len(s2)
4

 

As you can see, s contains three codepoints: b, \xe4 (ä) and r, but s2 contains four codepoints, b, a, \u0308 and r.


\u0308 is the COMBINING DIAERESIS character. "Diaeresis" is the proper English name for "umlaut" and "combining" means "add it to the previous codepoint", i.e. "put an umlaut on the 'a'."

 

In this particular situation, you need to normalise the string to form NFC. That turns s2 into s.


That will give you the result you expect in this case, but fundamentally the lengths of strings, Unicode or encoded, and the number of codepoints in a Unicode string do not correspond to the number of characters in the rendered text.

Edited by deanishe
Add links
Link to comment

Ah, it's dict.cc. Which dictionary does the workflow expect? EN -> DE or DE -> EN? Or is either okay?

 

You should probably add a download link to the README on GitHub for people who don't find the workflow via the forum. Perhaps create a GitHub release, which is where most people expect to find downloads.

Link to comment

You need to rebuild the workflow, I think.

 

I just installed it and it doesn't work. In dict_cc.rb it says FILENAME = 'cmodbnkkcf-52898166-ea5ea5.txt'.freeze instead of FILENAME = 'full_dictionary.txt'.freeze

 

One tip: It would be a good idea to create the indices in the workflow's cache or data folder, not in the workflow folder itself. With the indices, the workflow is over 100MB, which is not very sync-friendly (standard Dropbox is only 2GB and 5% of that for one workflow is not ideal).

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...