alfred_enthusiast Posted September 15, 2018 Share Posted September 15, 2018 I'm creating a workflow and I'm having some trouble to manage the input in non-ascii characters. If I set the input as `argv` and type "bär", the script receives a string with length = 4 that is not == "bär" Any suggestion how I should deal with this string? Link to comment
deanishe Posted September 15, 2018 Share Posted September 15, 2018 (edited) Can't tell you how to deal with it without knowing what programming language you're using. Ruby? In any case, the problem is that "bär" may only be three letters, but it is four bytes (or can be). You need to use a Unicode-aware language/functions to get "correct" lengths. Here's a demonstration in Python: >>> s = 'bär' >>> len(s) 4 >>> s 'b\xc3\xa4r' >>> u = unicode(s, 'utf-8') >>> len(u) 3 >>> u u'b\xe4r' >>> As you can see in the line 'b\xc3\xa4r', there are four bytes, b, \xc3, \xa4 and r in the UTF-8 string. When I decode it to a Unicode string, len returns the correct number of characters. Edited September 15, 2018 by deanishe Link to comment
alfred_enthusiast Posted September 15, 2018 Author Share Posted September 15, 2018 (edited) thanks, I'm using Ruby. In particularly I'm using ruby 2.6.0dev (2018-06-22 trunk 63723) [x86_64-darwin17] I see that the length of the input string is 4 for "bär" but if I type "bär".length The result is 3 Forgot to mention, if I run the script via command line, like ./mzScript.rb bär the result is exactly what I expect. Edited September 15, 2018 by alfred_enthusiast Link to comment
deanishe Posted September 15, 2018 Share Posted September 15, 2018 (edited) 17 minutes ago, alfred_enthusiast said: I see that the length of the input string is 4 for "bär" but if I type That's because Alfred (or more precisely NSTask) uses a decomposed form of Unicode. In Python again: >>> s = u'bär' >>> s u'b\xe4r' >>> len(s) 3 >>> s2 = normalize('NFD', s) >>> s2 u'ba\u0308r' >>> len(s2) 4 As you can see, s contains three codepoints: b, \xe4 (ä) and r, but s2 contains four codepoints, b, a, \u0308 and r. \u0308 is the COMBINING DIAERESIS character. "Diaeresis" is the proper English name for "umlaut" and "combining" means "add it to the previous codepoint", i.e. "put an umlaut on the 'a'." In this particular situation, you need to normalise the string to form NFC. That turns s2 into s. That will give you the result you expect in this case, but fundamentally the lengths of strings, Unicode or encoded, and the number of codepoints in a Unicode string do not correspond to the number of characters in the rendered text. Edited September 15, 2018 by deanishe Add links Link to comment
alfred_enthusiast Posted September 15, 2018 Author Share Posted September 15, 2018 you're my hero I spent one day trying to figuring out how the two strings can be different. To make it work had only to normalise the input # from this input = ARGV[0].downcase # to this input = ARGV[0].downcase.unicode_normalize I can now finally complete my workflow and share on github. I'll mention this post of course Link to comment
deanishe Posted September 15, 2018 Share Posted September 15, 2018 Ooo. That looks useful. Be sure to post the workflow on the forum. I want it. Link to comment
alfred_enthusiast Posted September 15, 2018 Author Share Posted September 15, 2018 as promised this is the workflow. I'll post on the other section.https://github.com/ignazioc/DerDieDas Link to comment
deanishe Posted September 15, 2018 Share Posted September 15, 2018 Ah, it's dict.cc. Which dictionary does the workflow expect? EN -> DE or DE -> EN? Or is either okay? You should probably add a download link to the README on GitHub for people who don't find the workflow via the forum. Perhaps create a GitHub release, which is where most people expect to find downloads. Link to comment
deanishe Posted September 15, 2018 Share Posted September 15, 2018 You need to rebuild the workflow, I think. I just installed it and it doesn't work. In dict_cc.rb it says FILENAME = 'cmodbnkkcf-52898166-ea5ea5.txt'.freeze instead of FILENAME = 'full_dictionary.txt'.freeze One tip: It would be a good idea to create the indices in the workflow's cache or data folder, not in the workflow folder itself. With the indices, the workflow is over 100MB, which is not very sync-friendly (standard Dropbox is only 2GB and 5% of that for one workflow is not ideal). Link to comment
alfred_enthusiast Posted September 15, 2018 Author Share Posted September 15, 2018 @deanishe Thanks for all your suggestions. Definitely very helpful. Two questions: 1. why did you mention Dropbox? Is common approach to sync workflows over dropbox? 2. I would like to replace the indexes with a real db (sqllite) but for that I need some specific ruby gems. How can I execute an "installation script" when the workflow is installed? Link to comment
deanishe Posted September 15, 2018 Share Posted September 15, 2018 Dropbox is the standard method of syncing workflows and settings between machines. Other sync services don’t work very well. You don’t run an installation script. You bundle the gems with the workflow. Link to comment
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now