First Workflow: Need Ideas, Suggestions, Examples, Documentation

DrLulz · January 19, 2015

First off, I am new to programming. My interest in such things was sparked by the discovery of Alfred, and the Skimmer workflow which I use daily. This project began as a simple format change of the Evernote export Skimmer offers to suite outlines. I’ve looked at code from some workflows here, and I’m not fooling myself, in many ways what I have is still simple and is still very much based on Skimmer.

What I have thus far is a functioning Applescript which builds off of the Skim export to include exporting to OmniOutliner, as well as Evernote. I take my school powerpoint lectures, convert to pdf, and mark it up with annotations. When I’m finished I export with options specific to the lecture pdf. For example, I’ve included image extraction, text correction, word frequency, etc..

Description & Screenshots

GitHub Repository

Short Term Goals:

1. Move the Applescript into an Alfred Workflow where I can pass option selections directly.

Examples:

export -oo = Export to OmniOutliner

export -en = Export to Evernote

export -oo -i = Export to OmniOutliner & Extract Images

2. Shorten / Clean up the Applescript

I can’t help feeling like there is a better way to accomplish some of ideas in the script. The problem is that I’m blind to these areas. I don’t know there is a more optimal way to do xxx.

3. Evernote Images

I’m having difficulty adding images sequentially in Evernote. It seems that I’m only able to append a note if it directly follows note creation. This makes it near impossible to loop through the annotations adding them in sequence. Instead, I have to make a list of images (from grabbing boundaries of a Skim box note), create the note with html, and then repeat through the list of images adding them all at the end of the note. I found this post which describes this limitation. The line is 373 in skim-2-oo-n-en.scpt (repository). Does anyone know a way around this?

Long Term Goals:

1. Speed up the option “Find Spaces”

Some PDF’s, converted from .pptx, have mangled text. In some instances when text is copied & pasted from the PDF all the spaces are removed. I’ve tried many different ways of converting the .pptx (Mac & PC PowerPoint, Online Converters, Office Online, etc.), but the issue persists. I’ve implemented this python code into Applescript with a few additions, but I imaging that it would be faster if I had the .py in its own file and pass arguments to it from the script. Is there a simple example, using Alfred, that I can study?

The whole process is dependent on a word list sorted by frequency. Without factoring computer specs, the speed is a function of the number of words in the list and the amount of annotated text, and the quality is dependent on the type of words. I’ve pieced together a medical word list using Corpora whereby I made individual searches of nouns, verbs, etc. filtered by Medical, Speech and Academics. The results output a list with the word and the overall frequency of the word. I combined all the searches in a spreadsheet, sorted by frequency, and then removed duplicates. This list is good for PDF’s with medical terms, but not so good if the text is not medical. For those I used these extensive lists here.

I would like to call up Alfred, type in export -oo -fs, and then in the Alfred dropdown be able to select which list I want to use.

Also, the .py function breaks the string if number are contained in the annotation. Every character after a number is separated by a space (There are 7 d e a d l y s i n s). I need to figure a way to pass over numbers and resume after the number.

2. Find a new way to determine word frequency in the PDF

This frequency is not related to the above. This is the top 50 words contained in the entire PDF presentation. I use it to get the gist of lecture. Currently I’m using Applescript to accomplish this, but even though my journey began with Applescript I’m finding it less appealing every day. Good for some things, but unnecessary for most. I would like to use something different, preferably python because in my uneducated state seem to think it is intuitive. However, I still need to incorporate a list of words to ignore, so that I don’t get “The word “the” appeared 147 times in document X.”

3. Change export format via Alfred

Similar to my short term goal, yet different, and not needed immediately.

I’ve noticed many workflows allow the user to set default options by doing something like export -d which brings up a different menu. In this menu I could see Define Default Font or Define Default Font Size then select my option and type in the font I’d like to use.

I’m looking for any ideas, suggestions, examples, documentation, or forums for python similar to macscripter.net. Alfred has really changed the way I work and study, and I’m still surprised more people don’t use it or know about it.

Edited January 19, 2015 by DrLulz

smarg19 · January 19, 2015

Ok. There's obviously a lot going on here, but I think it's all right headed. First, I would recommend you stay with your inclination and start dipping into Python. Once I moved from AppleScript to Python, I've never looked back. If I ever need Applescript specific functionality, I can invoke directly within a Python script. So, big vote for Python.

On that same line, I think building things for and with Alfred is a great way to start, so I would also recommend that intuition. It has a number of limitations, which keep things fairly well within scope, but it's also incredibly flexible, so you can do crazy things (my own workflows are all over the map in terms of functionality and how they use Alfred's interface). I really think you should start by reading the documentation for the Alfred-Workflow Python library. It's a great introduction to writing workflows for Alfred in Python. And that library is (IMHO) the best for integrating a scripting language with Alfred. Then, I'd also read through the source code on GitHub to start getting a feel for how Python is written, especially for Alfred. Dean, who wrote the library, is a very gifted Pythonista and his code is an excellent example of Python code; very clear, very direct, very clean.

Next, start topics in this forum for specific questions. The people here are here frequently enough and like to help. And having a specific question with a specific goal will likely get some good responses. I've learned so much from the people on this forum.

Finally, on general advice, I'd remind you to keep refactoring. You will learn new things all the time; go back and put them into practice in a code base you know well.

So, that's general advice, here's some specific thoughts on your questions.

First, I answered the Evernote images question in your other thread response. The key are HTML image links with URLs to local files.

The infer_spaces() algorithm is a bit much to take in on first read, but it looks pretty efficient as it. As for passing arguments to a Python script from Applescript and getting the results back, check out my Wikify workflow, and specifically the en2md.scpt, which calls to a Python script. Passing over numbers should be fairly simple, just add an if clause to the function, like if isinstance(character, int): continue (this is Python). This would likely go best here in the Python code linked above:

# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
   if isinstance(s[i], int):
  continue
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k

Finally, for word frequency of texts, you should really check out the NLTK Python package. For some idea of how to get word frequency from a text using that Python package, see section "3.1 Frequency Distributions" here.

Those are my initial responses. Please do respond with any further questions...

stephen

DrLulz · January 24, 2015

I've been doing some reading. I didn't want to reply back without a specific question, and I'm not sure if I should start a new thread for this so I'll just post it here until someone tells me otherwise. The wikify workflow was a big help, and I was successful getting crosstalk working between applescript and python.

This work around is laughable. Please, someone point me in the right direction.

I am attempting to call the Alfred window from Applescript, make a selection, and then return the result. The idea is described in this post.

I’ve done the following.

tell application "Alfred 2" to search "choose list"

The keyword “choose list” is a script filter with an xml list.

The result/query is passed to a Run Script which modifies a property of another proxy.scpt.

Back to the Applescript, directly after search “choose list”, I retrieve the modified property of proxy.scpt. To ensure the returned property was the result of Script Filter —> Run Script I get the modified time of proxy.scpt and repeat until its greater than the current time, exiting after x many seconds.

set word_list to my get_wrd_list()

on get_wrd_list()
    set proxy_path to quoted form of ((do shell script "pwd") & "/proxy.scpt")
    set time_start to do shell script "date +%s"
    set time_mod to do shell script "stat -f %m " & proxy_path

    repeat until time_mod > time_start
        set time_mod to do shell script "stat -f %m " & proxy_path
        set time_exit to do shell script "date +%s"
        if time_exit - time_start > 10 then exit repeat
    end repeat

    set proxy_value to load script ((POSIX file ((do shell script "pwd") & "/proxy.scpt")))
    return proxy_value's word_list 
end get_wrd_list

Like I said, laughable and borderline ridiculous if not completely.

I’d like to do this in python. I’ve been looking at the Alfred-Workflow documentation, which is extensive and yet tailor made for someone starting out. I’m slowly becoming more proficient, and currently understand 25% of what I read (up from 10% last month), but some of the subtle nuances are lost.

How should I go about this?

Dropbox

Edited January 24, 2015 by DrLulz

smarg19 · January 24, 2015

I'm not 100% certain what exact usage you have in mind, but I use External Triggers extensively in my Pandoctor workflow to chain actions together. Look that up and see if that might work.

rice.shawn · January 26, 2015

I'm not sure what you're getting at here either. Why do you need to modify proxy.scpt? Two things on this:

(1) Keep your code and your data separate. If you are modifying a script to be run, then you should consider it data. Keep the original (template) in the workflow directory, and then all modified copies should be in either the data or the cache directory.

(2) If you're using Python in other places, then you should try to stick with it as much as possible. But if you want to use a "Choose From List" AppleScript function, then you could also try to call that from Python. Basically, the "Choose from List" AppleScript is just a string that you can invoke via the shell via osascript, so you use can Python to construct the string and call as Python would call any shell script.

(3) AppleScript prompts can be cool, but I always try to keep my use of them to a minimum. If your "Choose From List" comes up with just a couple (say fewer than 9) options, and you want to select only "one" option, then you could always keep everything in a script filter and have the user choose from a list of results there.

deanishe · January 27, 2015

Trying to use Alfred as a dialog box is a pretty hairy thing to try. Can you not split the script into two parts? One that calls Alfred with the list and a second one that Alfred calls with the user's choice?

What is your program/workflow intended to do?

DrLulz · January 31, 2015

I'm not 100% certain what exact usage you have in mind, but I use External Triggers extensively in my Pandoctor workflow to chain actions together. Look that up and see if that might work.

What I was going for was to start the workflow using a keyword with flags as options. So, something like "export -t -i -400 : tag1 tag2" would export highlights, titlecase text, export images at a width of 400px, and apply tags. Another option is infer spaces (-s) which is dependent on a wordlist. I was trying to speed up this process of inferring spaces by having wordlists suited to different areas, so the list would be shorter. (Enter snafu) If the user choose to infer spaces on their text I wanted the applescript to call up Alfred, have Alfred present my xml, make a selection, and then pass the query back to the Applescript to finish up.

On a different note, am I correct in assuming the xml "autocomplete" has nothing to do with filtering the xml based on user input, but is only to.. well... autocomplete by pressing right arrow? I tried to dissect the Evernote workflow to see how he accomplishes this. It's a bit complicated for my current level, but it looks like he's matching the typed letters to notebook names and then making the xml on the fly. How does one filter results?

I'm not sure what you're getting at here either. Why do you need to modify proxy.scpt? Two things on this:

(1) Keep your code and your data separate. If you are modifying a script to be run, then you should consider it data. Keep the original (template) in the workflow directory, and then all modified copies should be in either the data or the cache directory.

(2) If you're using Python in other places, then you should try to stick with it as much as possible. But if you want to use a "Choose From List" AppleScript function, then you could also try to call that from Python. Basically, the "Choose from List" AppleScript is just a string that you can invoke via the shell via osascript, so you use can Python to construct the string and call as Python would call any shell script.

(3) AppleScript prompts can be cool, but I always try to keep my use of them to a minimum. If your "Choose From List" comes up with just a couple (say fewer than 9) options, and you want to select only "one" option, then you could always keep everything in a script filter and have the user choose from a list of results there.

The proxy.scpt serves only one purpose for my workflow, which is to store the value of a variable for later retrieval (Its just one line, property word_list : ""). I cringed a little when I went this direction because I know there has to be a better, more elegant, way to achieve such a simple task. I read about cached data and persistent data, but I'm not sure how to access and write to these directories, and I still don't know if that's overkill for such a simple thing.

In regards to "Choose from list," I was trying to steer clear of having applescript accept input and do that solely from Alfred (better aesthetics). The last part (your #3) was my goal, but I couldn't for the life of me figure a way to get the query back to the running script.

Trying to use Alfred as a dialog box is a pretty hairy thing to try. Can you not split the script into two parts? One that calls Alfred with the list and a second one that Alfred calls with the user's choice?

What is your program/workflow intended to do?

Splitting the script, that sounds like it might work. I will need to read more about cached data so that the original options are passed along to the second script. My confusion lied wholly in the fact that I couldn't see a way to get more input from the user (with Alfred) to modify the running process.

deanishe · January 31, 2015

On a different note, am I correct in assuming the xml "autocomplete" has nothing to do with filtering the xml based on user input, but is only to.. well... autocomplete by pressing right arrow? I tried to dissect the Evernote workflow to see how he accomplishes this. It's a bit complicated for my current level, but it looks like he's matching the typed letters to notebook names and then making the xml on the fly. How does one filter results?

Exactly. autocomplete is what Alfred will show as the query if a user hits TAB on a result item (or ENTER, too, if the item is not valid).

arg is the value of the item that Alfred will pass to any subsequent actions.

How does one filter results?

I suspect your mental model of how workflows work might be a bit wrong. You filter the results before generating the XML.

It might be worthwhile to read a tutorial or two or take apart some simple workflows.

Here's a tutorial I wrote. It's tied closely to the workflow library it belongs to, and uses Python no AppleScript, but I think it should be understandable enough for you to get an idea of how Script Filters work.

My confusion lied wholly in the fact that I couldn't see a way to get more input from the user (with Alfred) to modify the running process.

Alfred isn't designed to be used by other apps, it's designed to use other apps. You'll find workflows a lot simpler to build if you use Alfred to control X instead of X to control Alfred.

Edited January 31, 2015 by deanishe

DrLulz · February 1, 2015

Thanks for the switch in perspective.

I read part one of your tutorial twice, the second time very slowly. As an exercise I'm trying to recreate the idea with something I'd use often. Merriam-Webster has a Medical Dictionary, and also an API. The results are returned with xml, so I found a thread where you do something similar. In the example you're grabbing html tags, unless I'm missing some fundamental idea (very likely the case) I didn't see why I couldn't do this with the returned xml. Though, now that I'm writing this I don't see why I'm parsing xml to turn it back to xml other than to display it in Alfred. I still don't think I'm looking at this correctly. It would great if I could get this working similar to your searchio, filtering per keypress.

The Dictionary API Returns:

<entry_list version="1.0">
<entry id="insulin">
    <hw>in·su·lin</hw>
    <pr>ˈin(t)-s(ə-)lən</pr>
    <sound>
        <wav>insuli01.wav</wav>
        <wpr>!in(t)-s(u-)lun</wpr>
    </sound>
    <fl>noun</fl>
    <def>
        <sensb>
        <sens>
        <dt>
            a protein hormone that is synthesized in the pancreas from proinsulin and secreted by the beta cells of the is            lets of Langerhans, that is essential for the metabolism of carbohydrates, lipids, and proteins, that regulate            s blood sugar levels by facilitating the uptake of glucose into tissues, by promoting its conversion into glyc            ogen, fatty acids, and triglycerides, and by reducing the release of glucose from the liver, and that when pro            duced in insufficient quantities results in diabetes mellitus
            <dx>
                see
                <dxt>ILETIN</dxt>
            </dx>
        </dt>
        </sens>
        </sensb>
    </def>
</entry>
</entry_list>

Based on the linked thread, I was trying:

# encoding: utf-8

from workflow import Workflow, ICON_WEB, web
from lib import BeautifulSoup
import sys


API_KEY = 'API KEY'


def request_mdict_search(query):
    url = u'http://www.dictionaryapi.com/api/v1/references/medical/xml'
    r = web.get(url, query, {'?key=': API_KEY})

    r.raise_for_status()
    return parse_mdict_results(r.content)


def parse_mdict_results(content):
    soup = BeautifulSoup(content)
    words = soup.findAll('entry')
    results = []
    for word in words:
        part1 = word.find('entry')
        title = wf.decode(part1.replace('<entry id="', '').replace('">', ''))
        wf.logger.debug(title)
        url = u'http://www.merriam-webster.com/medical/' + title
        part2 = table.find('dt')
        desc = wf.decode(part2.replace('<dt>', '').replace('</dt>', '')) # going to be a problem
        results.append((title, url, desc))
    return results


def main(wf):
    query = wf.args[0]

    def wrapper():
        return request_mdict_search(query)

    #results = wf.cached_data('results', wrapper, max_age=60)
    results = request_mdict_search(query)

    for result in results:
        wf.add_item(
            title=result[0],
            subtitle=result[1],
            arg=result[1],
            valid=True,
            icon=ICON_WEB)

    wf.send_feedback()

if __name__ == '__main__':
    wf = Workflow()
    sys.exit(wf.run(main))

Ideally I'd have the Title and Subtitle in Alfred as the Word and Definition. Though I'm concerned with <dx> & <dxt> tags inside the definition tag. I would need to ignore these in the result.

Alternatively the website gives a php example.

<?php

// This function grabs the definition of a word in XML format.
function grab_xml_definition ($word, $ref, $key)
    {    $uri = "http://www.dictionaryapi.com/api/v1/references/" . urlencode($ref) . "/xml/" .
                    urlencode($word) . "?key=" . urlencode($key);
        return file_get_contents($uri);
    };

$xdef = grab_xml_definition("test", "medical", "API_KEY");

?>

Which, if either, of the two directions should I work towards?

smarg19 · February 1, 2015

Python with BeautifulSoup is definently the way to go. It's sleek, intuitive, and fast. You're close with your python example, so I'd stay the course.

Reading through the BS documentation will help get a sense of what's possible. Also, check out other's workflows. My LibGen workflow is right along this path. When I get some free time, I'll give you more specific feedback. But it really does look pretty good already.

DrLulz · February 1, 2015

I'm finding it difficult to see my errors. For instance, I'm getting 'unicode' object has no attribute 'items' but I'm not sure where to look. Might this have something to do with for word in words: or for result in results:?

DrLulz · February 1, 2015

Here's what I've changed. I not getting an error but I'm not getting results either.

# encoding: utf-8

from workflow import Workflow, ICON_WEB, web
from bs4 import BeautifulSoup
#from lxml import etree
import sys




API_KEY = 'API'




def _mdict_search(query):
    url = u'http://www.dictionaryapi.com/api/v1/references/medical/xml/' + query + "?key="
    params = dict(auth_token=API_KEY)
    
    r = web.get(url, params)
    r.raise_for_status()
    return _mdict_results(r.content)




def _mdict_results(content):
    soup = BeautifulSoup(content)
    words = soup.find_all(id=True)
    results = []
    for word in words:
        term = word.find_all('entry')
        title = wf.decode(term.replace('<entry id="', '').replace('">', ''))
        wf.logger.debug(title)
        url = u'http://www.merriam-webster.com/medical/' + title
        define = table.find('dt')
        desc = wf.decode(define.replace('<dt>', '').replace('</dt>', '')) # going to be a problem
        results.append((title, url, desc))
    return results




def main(wf):
    query = wf.args[0]


    def wrapper():
        return _mdict_search(query)


    #results = wf.cached_data('results', wrapper, max_age=60)
    results = _mdict_search(query)


    for result in results:
        wf.add_item(
            title=result[0],
            subtitle=result[2],
            arg=result[1],
            valid=True,
            icon=ICON_WEB)


    wf.send_feedback()


if __name__ == '__main__':
    wf = Workflow()
    sys.exit(wf.run(main))

Where am i going wrong?

deanishe · February 2, 2015

This should do the trick:

# encoding: utf-8

from __future__ import print_function, unicode_literals

from HTMLParser import HTMLParser
from urllib import quote
import sys
import hashlib
from xml.etree.cElementTree import fromstring as make_etree

from workflow import Workflow, ICON_WEB, web


API_KEY = 'API KEY'
# These don't really need to be bytes, but it's more correct, as
# we're going to replace {query} and {title} with bytestrings
API_URL = b'http://www.dictionaryapi.com/api/v1/references/medical/xml/{query}'
WEB_URL = b'http://www.merriam-webster.com/medical/{title}'

log = None

def dict_search(query):
    """Return XML from API"""
    url = API_URL.format(query=quote(query.encode('utf-8')))
    params = {'key': API_KEY}
    r = web.get(url, params)
    r.raise_for_status()
    # Return `r.text`, not `r.content` because
    # we want Unicode, not bytes. In this case, returning `r.content`
    # should work just as well, but `web.py` has additional encoding
    # information from the HTTP headers, so it's usually a good idea
    # to let it do any decoding
    return r.text


def sanitise_output(text, simplify=False):
    """Decode HTML entities. Also remove/replace some non-ASCII characters
    if `simplify` is True"""
    # HTML entities -> text
    h = HTMLParser()
    text = h.unescape(text)
    if simplify:
        # Remove ·
        text = text.replace('\xb7', '')
        # Replace en-dashes with hyphens
        text = text.replace('\u2013', '-')

    return text


def parse_response(xmlstr):
    """Parse XML response to list of dicts"""
    results = []
    root = make_etree(xmlstr)
    entries = root.findall('entry')
    log.debug('{} entries'.format(len(entries)))

    for entry in entries:
        # Default values
        data = {'title': None, 'url': None, 'definition': None}
        # Title
        hw = entry.find('hw')
        if hw is not None and hw.text is not None:
            title = sanitise_output(wf.decode(hw.text), True)
            data['title'] = title
            data['url'] = WEB_URL.format(title=quote(title.encode('utf-8')))

        # Definition
        definition = entry.find('def/sensb/sens/dt')
        if definition is not None and definition.text is not None:
            data['definition'] = sanitise_output(wf.decode(definition.text))

        log.debug(data)

        if data['title'] is None:  # Ignore results w/o title
            continue

        results.append(data)

    return results


def make_cache_key(query):
    """Return cache key for `query`"""
    m = hashlib.md5(query)
    return m.hexdigest()


def main(wf):
    query = wf.args[0]

    def wrapper():
        return dict_search(query)

    # We want to keep a separate cache for each query, so we generate
    # a cache key based on `query`. We use an MD5 hash of `query` because
    # query might contain characters that are not allowed in filenames
    key = 'search-{}'.format(make_cache_key(query))
    
    # During development, cache the XML rather than the parsed results.
    # This way, we can change the parsing code and get different results
    # in Alfred without hitting the API all the time
    xmlstr = wf.cached_data(key, wrapper, max_age=600)

    results = parse_response(xmlstr)

    # Compile results for Alfred
    for result in results:
        wf.add_item(
            title=result['title'],
            subtitle=result['definition'],
            arg=result['url'],
            valid=True,
            icon=ICON_WEB)

    wf.send_feedback()


if __name__ == '__main__':
    wf = Workflow()
    log = wf.logger
    sys.exit(wf.run(main))

I'm not sure exactly where your code is going wrong (I don't have BeautifulSoup installed), but there are a few issues with it.

Firstly, you have the URL wrong. API_KEY needs to go in param key, not auth_token, and ?key= wants removing from the URL. When you add strings to a URL, you also need to %-escape them (urllib.quote) to ensure the URL is valid. This function requires a bytestring, however, which is why I do quote(title.encode('utf-8')) instead of just quote(title). The latter would explode if given non-ASCII.

Secondly, there's no real need to use BeautifulSoup to parse XML. At least not something this simple. The built-in library is more than up to the job. There's no need to use things like str.replace with this XML. The XML library will parse and return the text as you want it.

Thirdly, when you cache results, use a cache key based on query. If you just use results, all your result sets will use the same cache file, regardless of the query, so you'll always see the results from the first query that got cached until the cache expires instead of the results for the current query.

I've added a bit of tidying up code in sanitise_output(). It turns HTML entities like & etc. back into proper text, which is important, but also removes/replaces a couple of characters I noticed popping up in result titles that mean the web URL will lead to no results. That code will probably need refining over time as you find other characters that break the web URL.

If you have any questions, ask away.

Edited February 2, 2015 by deanishe

DrLulz · February 2, 2015

I got as far as returning the below from python, but was running into many issues in different circumstances.

[(u'insulin', u'http://www.merriam-webster.com/medical/insulin' u'a protein hormone that is synthesized in the pancreas from proinsulin and secreted by the beta cells of the islets of Langerhans, that is essential for the metabolism of carbohydrates, lipids, and proteins, that regulates blood sugar levels by facilitating the uptake of glucose into tissues, by promoting its conversion into glycogen, fatty acids, and triglycerides, and by reducing the release of glucose from the liver, and that when produced in insufficient quantities results in diabetes mellitus '), (u'insulin coma therapy', u'http://www.merriam-webster.com/medical/insulin%20coma%20therapy' <sx>INSULIN SHOCK THERAPY</sx>), (u'insulin{ndash}dependent diabetes', u'http://www.merriam-webster.com/medical/insulin%7Bndash%7Ddependent%20diabetes' <sx>type 1 diabetes</sx>)]

One such issue was encoding. When reading your tutorial, you mention to convert everything initially as its brought in and then convert back on the way out. There seems to be many ways to approach this (u'', unicode(), .decode(), etc), though I may be mixing apples and oranges. How can I reveal the encoding of any given string, so that I can begin to understand this idea.

Before your answer I had changed the URL to u'x' + urllib.quote(y) + u'z' + API out of desperation. So the ? and = are inferred from their type? This would make good sense. In python, would you call {'key': API_KEY} a record, and how is this distinct from dict?

I had given up on str.replace, and was using the below, but the definition still contained some children. Removing BeautifulSoup all together seems optimal to say the least.

       term = word.get('id')
       url = u'http://www.merriam-webster.com/medical/' + urllib.quote(term)
       def_tag = word.find('dt')       
       term_def = def_tag.contents[0]

Your last two points I would like to ask questions on after I've digested it a little.

Also, thank you very much for taking the time.

deanishe · February 2, 2015

One such issue was encoding. When reading your tutorial, you mention to convert everything initially as its brought in and then convert back on the way out. There seems to be many ways to approach this (u'', unicode(), .decode(), etc), though I may be mixing apples and oranges. How can I reveal the encoding of any given string, so that I can begin to understand this idea.

To see what type an object has, do print(repr(obj)) or log.debug(repr(obj)). Unicode strings will look like u'blah' and bytestrings will look like 'blah'.

u'' doesn't convert anything: it's just a way of specifying in your source code that the string is a Unicode string. In my script, I imported unicode_literals, which switches the behaviour, so that 'blah' is a Unicode string and b'blah' is a bytestring.

'blah'.decode('utf-8') and unicode('blah', 'utf-8') are equivalent. Workflow.decode('blah') is a better bet in workflows, though, as you can also call it on Unicode strings without an error and it also normalises Unicode.

Before your answer I had changed the URL to u'x' + urllib.quote(y) + u'z' + API out of desperation. So the ? and = are inferred from their type? This would make good sense. In python, would you call {'key': API_KEY} a record, and how is this distinct from dict?

With web.py, you can either pass a URL already containing a query string (?a=b&c=d) or params, which web.py will turn into a query string (so {'key': API_KEY} becomes ?key=API_KEY). Unfortunately, it's not smart enough to handle both.

{'key': API_KEY} and dict(key=API_KEY) are functionally equivalent. They each have their advantages, but I tend to use the former as, all other things being equal, it runs faster.

I had given up on str.replace, and was using the below, but the definition still contained some children. Removing BeautifulSoup all together seems optimal to say the least.

Yeah. I think ElementTree's parsing behaviour is better in this regard.

Your last two points I would like to ask questions on after I've digested it a little.

Also, thank you very much for taking the time.

No problem.

Edited February 2, 2015 by deanishe

Sign In

First Workflow: Need Ideas, Suggestions, Examples, Documentation

Recommended Posts

DrLulz

Link to comment

smarg19

Link to comment

DrLulz

Link to comment

smarg19

Link to comment

rice.shawn

Link to comment

deanishe

Link to comment

DrLulz

Link to comment

deanishe

Link to comment

DrLulz

Link to comment

smarg19

Link to comment

DrLulz

Link to comment

DrLulz

Link to comment

deanishe

Link to comment

DrLulz

Link to comment

deanishe

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity