Jump to content

ZotQuery: an Alfred workflow for Zotero


Recommended Posts

This is just an idea to speed up zotquery and I have no idea whether it actually works. Right now, I think that zotquery starts an entirely new search whenever I change the search query. As a result, first typing "Kant" and then adding a word of the title such as "metaphysics" (search query "Kant metaphysics") both take the same amount of time. The same is true for slow typing (start with "Kan" and take a moment to add "t"). Would it be possible to save the intermediate search results and filter from that set of results to speed up the search? Of course, this only works when users add something not when they delete parts of the search query. Just a thought. Maybe I am completely wrong about this...   Thanks for the great workflow!!

Link to comment

It does start a new search whenever you change the query. That's how Alfred works. It calls the workflow anew every time you press a key instead of interacting with a longer-running process (that could hold onto its filtered set of results).

Ideally, a workflow would create a search index for that particular set of data, but I haven't looked into that yet (I wrote one of the underlying libraries that ZotQuery uses and which provides the filtering functionality).

The filter function uses a relatively complex search algorithm (to rate "of", for example, as a better match for "OmniFocus" than for "Duke of York", which is in turn better than "Smirnoff"), and as a result isn't capable of rapidly searching thousands of data (complex code, no search index, relatively slow programming language) unless you turn off some of the filter options.

 

You could try "working with" the search algorithm, i.e. "km" will match "Kant Metaphysics", or if you're prepared to get your hands dirty, you could look at the search options and tinker with ZotQuery's source here and here to whittle down the applied rules to achieve an acceptable compromise between flexibility and performance.

I can look into the feasibility of caching "partial" searches and/or creating a search index, but fundamentally, the library on which ZotQuery is based wasn't designed for searching thousands of data without disabling a lot of the "intelligence" of the search algorithm.

Edited by deanishe
Link to comment

Would caching search results with a lifespan of ~2 seconds achieve this result? When Alfred calls the search filter, it first checks for the cache. If it's there, filter from that data set. If it's not, filter from the full data set.

Obviously this would mean you couldn't run different queries within 2 seconds of one another. But that may be a worthwhile trade off...

Link to comment

I also thought about caching the results. smarg19, I think the cache should combine a lifespan and the search query. So when I search for "Kan" the cache is only used when the next query (either new or extended search) starts with "Kan". So the user can actually query for different things within the lifespan. I also think the complex search query shouldn't be a problem. Searching for "of" matches  "OmniFocusFocus", "Duke off York" and "of" (stupid modification of example above). Not sure about the ranking but all hits get cached. Now I extend my search to "off" by adding another "f". ZotQuery uses the cache if (1the new search happens within the lifespan and (2) includes the previous search "of" as the first characters of the query. In this case, ZotQuery would just filter and rank the three previous results and end up with two returns: "OmniFocusFocus" and "Duke off York". Does that make sense?

Edited by LeChuck
Link to comment

You'd have to check on space issues (how much more space, but you might give the option to clear the search cache only when you update the database cache because all the search results should hold until that happens. To make it more space-efficient, you could define a max space usage that the search cache could have (either hardcoded or set by the user) and then invalidate the oldest caches when when search hits that upper limit.

Link to comment

You'd have to check on space issues (how much more space, but you might give the option to clear the search cache only when you update the database cache because all the search results should hold until that happens. To make it more space-efficient, you could define a max space usage that the search cache could have (either hardcoded or set by the user) and then invalidate the oldest caches when when search hits that upper limit.

Ah, so you're suggesting that there be multiple filter caches? Sort of like your frequent searches, so that if you search for that thing again, ZotQuery will be faster. I was initially planning on just having one cache to speed up only the present search, but this set up could also work. 

 

In general, I'm trying to intelligently improve performance via caching in a few places. I'm almost done with a system for caching export results, so that exporting something the second time will not require an internet connection and will be instantaneous. I think that caching search results is an equally good idea. 

 

I'll test and investigate, but right now, here's my thought:

 

When a query is initiated, ZotQuery

  • (1) checks to see if there is a cache for that query (I will save query filter results under the query name). If there is a cache, return those results (this is the speed up for frequent searches situation).
  • If there isn't, (2) check for a recent cache (this is the speed up slow typing situation).
  • If there is, (3) check that new query starts with that recent query (so old query cached results filtered on thom, and new query is thomas).
  • If it does, (4) filter new query against cached data only (not full data set). If no to any of 1-4, then filter new query against full data set. 

 

In your cache folder, things would look like this: there would be a number of files for past searches; these files are named [query].json. There is one file called recent.json where the last search is stored. (There are also other files pertaining to the actioning processes, but those aren't important here).

 

I can have functions to

  • (1) delete all files when database is updated or
  • (2) delete all files when cache dir exceeds some space limit (this limit will be hardcoded, but easily altered by user).

 

Right now, my thought is to check the cache dir size alongside the check for whether or not to update the database. So that process will look like:

  • (1) Is database up-to-date? If yes, do nothing; if no, update and delete cache.
  • (2) Is cache dir appropriate size? If yes, do nothing; if no, delete cache. 

 

This will all be in one function called after each ZotQuery action (i.e. when you export or open something). 

 

Any thoughts or suggestions for this course of action?

Edited by smarg19
Link to comment

You shouldn't directly use the query as the filename, as this would break if someone enters "/" in a query, for example. You could use a slugify function or give the cache files random names and keep an index.
 
You should load a cache if query starts with the query of the cached data. So, say you have a cache for the query "cun", that would be loaded for the new query "cunning".
 
Whether you then save another cache for "cunning" is a tricky question (presumably, you'd also already have caches for "cunn", "cunni", "cunnin").  You might want to consider the number of results when deciding whether to cache them. If the parent set only contains, say, 300 entries, it's probably not worth creating another cache.
 
You'd probably want to purge all caches when the underlying data is updated.
 
I wouldn't worry too much about the cache size: instead of duplicating all the data, you can just cache the search keys and some form of ID with which you can look up the actual data in your main dataset. (Unless that's so big it significantly impacts performance.)
 
You could occasionally sweep the cache for old cache files (by last access time, not modification time).



Personally, I think creating a proper search index is a much better solution. You can use the background.py module to update the index whenever the data has changed (indexing would be a lot slower than filtering). If you put the index data in an sqlite database, actual search would be super fast. Edited by deanishe
Link to comment

Personally, I think creating a proper search index is a much better solution. You can use the background.py module to update the index whenever the data has changed (indexing would be a lot slower than filtering). If you put the index data in an sqlite database, actual search would be super fast.

 

What exactly would you recommend? An inverted index? Or both a forward index and an inverted index? Something else? I've started reading up on search indexes, and there are quite a few possibilities. Also, how could I make this work while keeping the various search types that ZotQuery uses?

Link to comment

An inverted index or a forward index. Exactly how you structure it depends on which search semantics you want to use. You might want to parse the search keys (in filter() terms) for capital letters or the first letters of words to maintain "of"->"OmniFocus" style searching, or just store the entire key in the DB and rely on sqlite's (awesome) search capabilities.

 

I don't know how ZotQuery users typically search.

 

If you're doing any additional processing, you'd probably want to store the search key in the database as a way of determining if an item has been updated and needs re-indexing.

Edited by deanishe
Link to comment

An inverted index or a forward index. Exactly how you structure it depends on which search semantics you want to use. You might want to parse the search keys (in filter() terms) for capital letters or the first letters of words to maintain "of"->"OmniFocus" style searching, or just store the entire key in the DB and rely on sqlite's (awesome) search capabilities.

 

I don't know how ZotQuery users typically search.

 

If you're doing any additional processing, you'd probably want to store the search key in the database as a way of determining if an item has been updated and needs re-indexing.

 

Unfortunately, this looks like too much of a learning curve at this stage. I think I'll go with the caching right now, since I know how to implement that. 

 

On search indexing, have you ever heard of/used Whoosh? I've been reading up on it. It's a pure Python search library. 

Link to comment

I have used it once or twice. It's pretty damn good. You might want to try it out for ZotQuery. It's a much better fit than Alfred-Workflow. It's probably the simplest solution if you don't want to start messing around with SQL.

 

TBH, creating an index is probably simpler than caching queries. You don't have all these thorny questions about which queries to cache, how to keep cache size under control, when to delete cached queries etc. It's also trivial to update the index in the background (using background.py). You can also do "abc AND xyz" with sqlite's full-text search.

Link to comment

One thing I'm still a bit confused on is the Whoosh schema. It appears that there is no way to nest fields (as in a JSON array of dicts for the value of a key. How do I store a list of item creators in Whoosh?

Link to comment

No idea, tbh. Haven't used it in a long time. I'd imagine it isn't possible to nest data because that's just not how most search engines work.

 

To prove my point re search indices, I've spent my evening writing a demo workflow showing how to use sqlite3 for a search index. Here's the workflow. Here's the GitHub repo. Hopefully, it's sufficiently well commented to be useful as a starting point. (If not, ask away.)

 

The dataset is the (almost) complete list of Project Gutenberg ebooks. There are 45,000 entries (author, title, URL).

 

Now tell me this doesn't totally rock  ;) 

 

Note: If you want to use "AND" or "OR", they have to be in capitals. (So sue me…)

 

demo.gif

 

Man, I should've used this from the start in Alfred-Workflow :(

 

By all means use whoosh, but I reckon sqlite3 is eminently suitable for ZotQuery's needs (especially as whoosh is designed for searching separate documents, not a single monolithic data file) and is included with the system Python. A bit of SQL is something every developer should know.

Edited by deanishe
Link to comment

Here's some sample log output from the above workflow to give you a concrete idea of exactly how fast sqlite full-text search is:

11:10:53 background.py:220 DEBUG    Executing task `indexer` in background...
11:10:53 index.py:43 INFO     Creating index database
11:10:53 index.py:56 INFO     Updating index database
11:10:53 books.py:110 INFO     0 results for `im` in 0.001 seconds
11:10:53 books.py:110 INFO     0 results for `imm` in 0.001 seconds
11:10:53 books.py:110 INFO     0 results for `imma` in 0.001 seconds
11:10:55 index.py:73 INFO     44549 items added/updated in 2.19 seconds
11:10:55 books.py:110 INFO     0 results for `imman` in 1.710 seconds
11:10:55 index.py:80 INFO     Index database update finished
11:10:55 background.py:270 DEBUG    Task `indexer` finished
11:10:55 books.py:110 INFO     15 results for `immanuel` in 0.002 seconds
11:10:58 books.py:110 INFO     100 results for `p` in 0.017 seconds
11:10:59 books.py:110 INFO     4 results for `ph` in 0.002 seconds
11:10:59 books.py:110 INFO     0 results for `phi` in 0.002 seconds
11:11:00 books.py:110 INFO     9 results for `phil` in 0.002 seconds
11:11:00 books.py:110 INFO     3 results for `philo` in 0.002 seconds
11:11:00 books.py:110 INFO     0 results for `philos` in 0.001 seconds
11:11:00 books.py:110 INFO     0 results for `philosp` in 0.001 seconds
11:11:01 books.py:110 INFO     0 results for `philospo` in 0.001 seconds
11:11:01 books.py:110 INFO     0 results for `philosp` in 0.001 seconds
11:11:02 books.py:110 INFO     0 results for `philos` in 0.002 seconds
11:11:02 books.py:110 INFO     0 results for `philoso` in 0.001 seconds
11:11:02 books.py:110 INFO     0 results for `philosoh` in 0.003 seconds
11:11:02 books.py:110 INFO     0 results for `philosohp` in 0.002 seconds
11:11:02 books.py:110 INFO     0 results for `philosohpy` in 0.002 seconds
11:11:03 books.py:110 INFO     0 results for `philosohp` in 0.002 seconds
11:11:03 books.py:110 INFO     0 results for `philosoh` in 0.001 seconds
11:11:03 books.py:110 INFO     0 results for `philoso` in 0.001 seconds
11:11:03 books.py:110 INFO     0 results for `philosop` in 0.001 seconds
11:11:03 books.py:110 INFO     0 results for `philosopj` in 0.001 seconds
11:11:03 books.py:110 INFO     0 results for `philosopjy` in 0.002 seconds
11:11:04 books.py:110 INFO     0 results for `philosopj` in 0.002 seconds
11:11:04 books.py:110 INFO     0 results for `philosop` in 0.002 seconds
11:11:04 books.py:110 INFO     0 results for `philosoph` in 0.002 seconds
11:11:04 books.py:110 INFO     100 results for `philosophy` in 0.012 seconds
11:11:08 books.py:110 INFO     100 results for `philosophy ` in 0.007 seconds
11:11:09 books.py:110 INFO     2 results for `philosophy t` in 0.002 seconds
11:11:09 books.py:110 INFO     0 results for `philosophy ti` in 0.002 seconds
11:11:10 books.py:110 INFO     0 results for `philosophy tit` in 0.002 seconds
11:11:11 books.py:110 INFO     0 results for `philosophy titl` in 0.002 seconds
11:11:11 books.py:110 INFO     0 results for `philosophy title` in 0.002 seconds
11:11:11 books.py:110 INFO     100 results for `philosophy title:` in 0.007 seconds
11:11:11 books.py:110 INFO     0 results for `philosophy title:t` in 0.002 seconds
11:11:11 books.py:110 INFO     0 results for `philosophy title:th` in 0.002 seconds
11:11:11 books.py:110 INFO     72 results for `philosophy title:the` in 0.010 seconds
11:11:12 books.py:110 INFO     40 results for `philosophy a` in 0.006 seconds
11:11:13 books.py:110 INFO     0 results for `philosophy au` in 0.002 seconds
11:11:13 books.py:110 INFO     0 results for `philosophy aut` in 0.002 seconds
11:11:13 books.py:110 INFO     0 results for `philosophy auth` in 0.002 seconds
11:11:13 books.py:110 INFO     0 results for `philosophy autho` in 0.002 seconds
11:11:13 books.py:110 INFO     0 results for `philosophy author` in 0.002 seconds
11:11:14 books.py:110 INFO     100 results for `philosophy author:` in 0.009 seconds
11:11:14 books.py:110 INFO     0 results for `philosophy author:k` in 0.002 seconds
11:11:14 books.py:110 INFO     0 results for `philosophy author:ka` in 0.002 seconds
11:11:14 books.py:110 INFO     0 results for `philosophy author:kan` in 0.002 seconds
11:11:15 books.py:110 INFO     0 results for `philosophy author:kant` in 0.002 seconds
11:11:18 books.py:110 INFO     3 results for `philosophy author:a` in 0.003 seconds
11:11:18 books.py:110 INFO     0 results for `philosophy author:ar` in 0.002 seconds
11:11:19 books.py:110 INFO     0 results for `philosophy author:ari` in 0.002 seconds
11:11:19 books.py:110 INFO     0 results for `philosophy author:aris` in 0.002 seconds
11:11:20 books.py:110 INFO     0 results for `philosophy author:arist` in 0.002 seconds
11:11:20 books.py:110 INFO     0 results for `philosophy author:aristo` in 0.002 seconds
11:11:20 books.py:110 INFO     0 results for `philosophy author:aristot` in 0.002 seconds
11:11:20 books.py:110 INFO     0 results for `philosophy author:aristotl` in 0.002 seconds
11:11:20 books.py:110 INFO     0 results for `philosophy author:aristotle` in 0.002 seconds
11:11:22 books.py:110 INFO     15 results for `author:aristotle` in 0.002 seconds
 
Edited by deanishe
Link to comment

You're right. This is bonkers. I'm reading up on it all now, bit I think this is a solid foundation. I'll need to do lots of fiddling to optimize this setup for ZotQuery. The original Zotero data is already in a SQLite database. I wrote a script to translate that to JSON so I'll need to figure out the most efficient way to get the key SQLite data into the FTS virtual table. I could also see how a Pythonic wrapper for this core functionality could be a nice addition to Alfred Workflows. I'm def going to pursue this course of action, so thanks for the point in the right direction

Link to comment

If the data is already in an SQLite database, you might consider just leaving it there. As you can see, the performance is ridiculous compared to messing around with JSON and Python. (SQLlite is pure C and super-optimised for exactly this kind of stuff—it's what CoreData is based on.)

 

I must admit, my MailTo workflow does pull data from SQLite databases and cache them in JSON, but the JSON is essentially search indices, and I would have used an SQLite cache if the performance weren't acceptable.

 

If you're creating an FTS virtual table (FTS3 only—FTS4 isn't supported by the system Python), you just need to insert a unique id (as a reference to the original full dataset) and the fields you want to search on. In the demo workflow, I included the id for demonstration purposes (it isn't used), but set its "rank" to zero, so it is ignored when ranking the search results.

 

If you really don't want to mess around with writing SQL queries, you can use an ORM like SQLAlchemy or Peewee. That's how most Python developers use SQL databases, tbh. They allow you to treat database tables/rows as classes/instances. Very pleasant to use.

 

I suspect this might mean a serious restructuring of ZotQuery, but IMO the performance is compelling. It all depends on what the typical dataset size is. You can't search thousands of items with Alfred-Workflow's filter() function, but a JSON-based data cache (properly keyed) should be just fine for at least 10,000  items if combined with a more efficient search mechanism.

 

Moving entirely to using the original SQLite db might be more work than it's worth, but I reckon re-implementing the search in SQLite is well worth it.

 

WRT Alfred-Workflow, I've been thinking all day about a useful abstraction that could use SQLite for search. The user would, in any case, have to specify a schema. But how do I go about indexing/updating the search database?  Does the user call an index() function with all the data, or specify a callback that the indexer can call if it's out of date? Should the indexer return just the ids (and rank?) of the search results, or require a callback to retrieve the full dataset, so it can return the complete data like filter()?

Edited by deanishe
Link to comment
  • 1 month later...

Hello Stephen, 

 

I'm not sure exactly what is going on, but the config process isn't happening for me. 

 

My versions: OSX 10.9.4, Alfred 2.3, Python 2.7.5, Zotero 4.0.21.5, ZotQuery 8.5

 

After installing Zotquery, I run z:config through Alfred. It creates the workflow folder and prefs.json/settings.json files, but then does nothing afterwards. The only debug output that Alfred gives me is:

 

 

Starting debug for 'ZotQuery'

 

[iNFO: alfred.workflow.input.keyword] Processing output 'alfred.workflow.action.script' with arg ''

 

I see that the config file runs:

# then set the paths
python zotquery.py --config "paths"

# next get the user information
python zotquery.py --config "api"

# finally, set export preferences
python zotquery.py --config "prefs"

When trying to run these directly in the terminal, I get:

Mikels-MacBook-Air:~ mikelduffy$ python /Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/zotquery.py --config "paths"
Traceback (most recent call last):
  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/zotquery.py", line 86, in <module>
    bundler.init()
  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/bundler.py", line 509, in init
    _update()
  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/bundler.py", line 288, in _update
    'Your workflow will continue momentarily')
  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/bundler.py", line 258, in _notify
    notifier = utility('terminal-notifier')
  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/bundler.py", line 194, in __call__
    path = self.func(*args, **kwargs)

----Repeats ~200-300 times----

  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/bundler.py", line 473, in utility
    _update()
  File "/Users/mikelduffy/Dropbox/Alfred/Alfred.alfredpreferences/workflows/user.workflow.B4369E24-57D2-410F-B82D-71359CE0CD1E/bundler.py", line 282, in _update
    update_data = json.load(file, encoding='utf-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 351, in loads
    return cls(encoding=encoding, **kw).decode(s)
RuntimeError: maximum recursion depth exceeded while calling a Python object

I'm not too sure where to go from here. Any tips? =)

 

Many thanks in advance!

Link to comment

Hello Stephen, 

 

I'm not sure exactly what is going on, but the config process isn't happening for me. 

 

My versions: OSX 10.9.4, Alfred 2.3, Python 2.7.5, Zotero 4.0.21.5, ZotQuery 8.5

 

After installing Zotquery, I run z:config through Alfred. It creates the workflow folder and prefs.json/settings.json files, but then does nothing afterwards.

Unfortunately, the problem lies in a 3rd party utility. I think it might be broken. Try this and if it doesn't work, I'll post an update to avoid that 3rd party utility.

So, try going to /Users/mikelduffy/Library/Application Support/Alfred 2/Workflow Data/alfred.bundler-aries. Delete this alfred.bundler-aries directory. Then try to run the z:config again and tell me what happens. Hopefully alfred.bundler will reinstall itself and fix the problem. If it doesn't, I'll post a fix.

Link to comment

Thanks for the super fast response. Deleting alfred.bundler-aries worked! Everything is up and running now.. Not sure what happened with bundler, but all is good now!

 

:D

 

The aries version of bundler has a few random problems that creep up when internet connectivity isn't perfect. The error checking and recovery isn't implemented well enough there. So the 'hit it to fix it' approach to basically reboot the bundler deletes improperly downloaded files so that they can be redownloaded.

 

We're actively developing a new major version of the bundler (300+ commits this month) that will take care of all of these glitches and performance issues that the first major version of the bundler included. It's a rewrite from the ground up, so, yes. Soon this problem should disappear.

Link to comment

Thanks for the super fast response. Deleting alfred.bundler-aries worked! Everything is up and running now.. Not sure what happened with bundler, but all is good now!

 

:D

 

IIRC, that was mostly just a dumb mistake by me. That one, at least, is fixed in the newer code.

Link to comment

To any and all ZotQuery users,

I am working currently on a ground-up rewrite of the workflow. It will be faster and (hopefully) simpler to maintain. However, this rewrite is probably the best time to suggest features or functionality you think would make the workflow better. I can't promise that I will actually implement anything, but I would love to hear your thoughts on how ZotQuery could work better for you. :)

stephen

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...