Workflow Request: Searching sections/bookmarks in pdfs

mbigras · November 25, 2016

Hi All!

I'd love a workflow that lets me search easily through all the pdfs on my system.

How awesome would it be to quickly bring up a pdf by section, or a past bookmark/annotation?

I think it would make a library of pdfs much more accessible.

Requesting this workflow/pointing me towards one that already exists/some ideas for how to start working on it.

Thanks

Max

dunkaroo · November 29, 2016

I have the same problem after ditching "Devonthink" and replace with "Finder", "Dropbox" and "Ulysses". Searching PDF content is one feature I need to find replacement.

Luckily it's a pretty easy fix. "Spotlight" can do it. In "Spotlight Preference" I uncheck everything expect "PDF" so it only show PDF related.

Works really good because you get the real time preview too which I don't think even a Alfred workflow can do.

dunkaroo · November 29, 2016

If you really wanna look into create your own workflow look into "mdfind" command. It power the Spotlight. You can actually do Spotlight search in terminal.

>mdfind kind:pdf "QUERY"

dfay · November 29, 2016

Or if you don't want to reinvent the wheel you can use a file filter in Alfred: https://www.alfredapp.com/help/workflows/inputs/file-filter/

mbigras · November 29, 2016

Thanks for the responses. But how would I search specifically for the bookmarks? not just the filename.

Ideally, I'd like to be able to just keep all my pdfs in one folder and be able to search through all the bookmarks

dfay · November 30, 2016

I just looked at the metadata on an existing pdf (using mdls , the command-line program to access Spotlight metadata) then added a bookmark in Preview to that PDF, and looked at the metadata again.

The bad news is that Preview isn't saving the Bookmarks as metadata - they're encoded directly in the PDF....so you would need to use a tool like pdftk to read the bookmarks ... but you don't want to do that for each file each time you search....so you'd need to cache it....

The other option is to use Skim and have it save separate skim notes files - this won't save bookmarks per se but it will create searchable files with all your annotations -- see http://apple.stackexchange.com/questions/41013/is-there-a-way-spotlight-indexes-my-skim-bookmarks-notes. (There's also a preference in Skim to make it create the Skim notes files by default).

Having said that I've got a library of +/- 8000 PDFs and in 12 years with Skim I don't think I've ever searched my annotations directly -- they all end up in a BibDesk database and I search the accompanying bibliographical records instead. BibDesk allows the option to search within Skim notes (same developers), which works great on the rare occasions when I need it.

The pain with Skim is that if you're round-tripping to iOS you need to regularly be converting Skim notes to standard PDF annotations and vice versa. (This actually has got me thinking about dropping Skim, but there's nothing else I've found that is as customizable or scriptable. Even though @smarg19's Skimmer workflow hasn't been updated in ages....). If someone could write a cross-platform PDF editor that has a URL scheme that allows linking and indexing of annotations on iOS and MacOS I'd buy it in a second.

Following-up - it looks like the developers of Skim and BibDesk have ruled out searching standard (i.e. non-Skim) PDF annotations precisely because they are not included in metadata - see https://sourceforge.net/p/bibdesk/feature-requests/760/

Edited December 19, 2016 by dfay
typo

dfay · November 30, 2016

You can create a workflow to search Skim notes and open the corresponding PDF from Alfred quite easily...just tried it:

1) create a file filter for Skim notes (see link above on creating file filters)

2) connect 1) to a Replace action & replace skim with pdf

3) connect 2) to a File Open action

Here: https://www.dropbox.com/s/yjeurvlbhhd1aum/Simple Skim Notes Search.alfredworkflow?dl=0

2016-12-11 edit: you will probably also want to add kMDItemTextContent in the Advanced tab so that Alfred is searching the full text of the Skim notes.

Edited December 12, 2016 by dfay

h2ner · December 11, 2016

Hey mbigras,

Fantastic idea! As over the years we've gotten used to quick Spotlight navigation, as well as apps that do more such as Alfred, I had also thought of this long ago. Having mostly switched to eBooks, my library has grown and I've never yet found PDFs to be as convenient as print. Try finding and refinding one's place in a thousand page PDF; there remains much to be done with navigation.

I'm a newbie to Alfred, perhaps about one week going strong, and realizing this is possible to make myself, I've been hacking away at it. My AppleScript is somewhat ok; my Python less so. I initially had a decent filter sections of current PDF and search all implemented with SQLAlchemy using jpdfbookmarks, the best I've found so far, to get PDF metadata. Over months and years, the data could get pretty large. Not being too happy with MySQL full-text search, plus thinking about performance and eventually sharing the workflow, I'm trying a migration to Peewee and SQLite. Seems ok so far and I'm currently figuring out all PDF sections full-text search by relevance. When it was working before, it is simply amazing to be able to jump anywhere in an instant. It could take a bit of time to get everything sharable. As I'm not too great with Python or some of the necessary SQL, it'd be great to have someone look at it so we can get an initial decent version, let's say with a final DB schema and not worry about migrations, plus having more people that are willing to beta and wipe data as we make changes would help out greatly. Eventually, as the dataset could be huge, I'd really like the option to use PostgresSQL full-text plus perhaps make this into some complete PDF workflow that could include for instance rating, tagging, writing title/author metadata, reading lists, etc. as mentioned in another thread.

Edited December 11, 2016 by h2ner

dfay · December 12, 2016

Seems like a lot of work to index all your metadata, build your own database parallel to Spotlight, and maintain the index and links to files.

dfay · December 12, 2016

Here is a counterpart to the above workflow, to search the Skim notes of the currently open file.

https://dl.dropboxusercontent.com/u/6601556/Alfred/Search Skim Notes (Active Document).alfredworkflow

Skim's bookmarks are by page rather than by text, so they are not searchable directly but it would be easy to designate a note type and style that you only use for bookmarking (e.g. red highlights) and then limit the search accordingly to make a version that would just search them.

h2ner · December 12, 2016

3 hours ago, dfay said:

Seems like a lot of work to index all your metadata, build your own database parallel to Spotlight, and maintain the index and links to files.

It's not so bad, and as I've had the same wish as mbigras to search and navigate PDFs by section, and his idea of searching all sections of all PDFs in one command is amazing, I started working on it right away upon realizing it's possible with a workflow. There's nothing else like it yet, and for those, academics or otherwise, that try to read 1000 page books or books with a TOC of 1000 entries, it's quite necessary. Being able to search by TOC entries and then jump to any of thousands of PDFs, to the exact page, who wouldn't find that useful. Most of the basics are already done; just gotta figure out full text search since I switched from SQLAlchemy to SQLite. As the DB grows and let's say you have hundreds of thousands of indexed TOC entries, searching and ordering by relevance becomes more relevant. Just need to figure out that, fix it up a bit, see if there's any other initial needed changes before it becomes ready for everyday use, and hopefully that's enough to then let others then take a look. Indexing Skim notes, self added bookmarks, or PDF annotations, that might come later.

Edited December 12, 2016 by h2ner

h2ner · December 14, 2016

Here's the repo for anyone that wants to take a look:

https://bitbucket.org/gennaios/alfred-pdfbookmarks

Should I start a new thread at this point? Consider it alpha or beta. General use is sort of ok for searching within the current PDF. I was able to get search all PDFs working the other day with Peewee and only so far PostgreSQL, configurable with an environment variable PDF_DATABASE_ENGINE=postgresql. After about 2 weeks of learning and trying to get working SQLAlchemy, search current, then all, with MySQL, then moving to Peewee SQLite and then Postgres, I think I'm going to devote my effort and switch back to SQLAlchemy. In recent days, indexing 80 PDFs, the bookmarks DB is at 35,000 entries. mbigras amazing idea of searching all PDFs and going immediately to any page of any PDF, amazing, so I'm concentrating on getting that as best as possible with the idea that eventually, search should work well and give good results from hundreds of thousands or millions of bookmarks. Postgres and SQLAlchemy look to be the way; maybe I would have already figured that out if I had more Python and SQL experience or spent more time looking at requirements and planning. Whatever happens, SQLite search of the current PDF with optional Pgsql for search all, I'll try to maintain, at least in the beginning stages. I'd just like to have it working well so at least I can use it and read my PDFs, then more time can go to towards getting towards a 1.0 release. There are various issues; many details are each taking lots of effort as someone new to Python. Maybe others might take a look and help out. Later, it'll get moved to Github.

Edited December 14, 2016 by h2ner

h2ner · December 20, 2016

I've made some decent progress with the search all PDF bookmarks. Peewee ranked full-text search in PostgreSQL is super fast. SQLite I'll look at sometime so the script can get closer to being published. For filtering bookmarks of the current PDF, AppleScript is used to get the open PDF from Acrobat, Skim, or Preview. I was using Alfred filter results for the full list of pdf bookmarks, but switched to an SQL query to search the entire hierarchy of a section ("Formatting Method Basics" within chapter "String Fundamentals > String Formatting Expressions"). Is there a way to setup the script filter so it runs the AppleScript to get the opened PDF only once? There is an example of reusing script filters (link down) that seems to write info to a file. That seems like it could work and then I could remove the file at the next step, but what happens if someone presses escape and the next step doesn't run? Is there another way?

mbigras · February 20, 2017

Hi @h2ner!

It sounds like you've made some progress. Right on!

Any thoughts on putting together a demo screencast and installation instructions?

How has it been working for the last couple months?

I looked on packal but wasn't able to find it.

Have you thought about publishing it?

h2ner · February 20, 2017

Hello mbigras,

yes progress has been decent. It's now renamed to alfred-gnosis and on github:

https://github.com/gennaios/alfred-gnosis

It looks the same as in the above screenshot except for search all which will show "section | section hierarchy" in the title, and book file in the subtitle. In the past week or so, I've added ePub support, with a command-line option to recursively import a folder. ePub 2 only at the moment. calibre recently added command-line go to bookmark in ePub and it works by bookmark title. Not quite precise but a good start.

There are hotkeys bound which might not be suitable to everyone:

cmd-G - filter bookmarks of current book in Skim, Preview, Acrobat, or calibre viewer

control-D filter all bookmarks

cmd-U - remove bookmarks, should be update, remove and then reimport (in case of PDF/ePub edit), then run cmd-g again to import. (might be broken)

cmd-E - edit epub (currently hardcoded to BBEdit)

I've been refactoring in the last week and part of it might be partially broken. If so, possibly only the part that finds the ePub NCX to parse the TOC. I've started to add tests with py.test, as numerous times I've broken something trying to rewrite or add functionalities.

For setup, I haven't quite got the hang of how to include dependencies (e.g., in ./lib/) and use those. peewee, python-epub and lxml are among the requirements, running the script from the command-line, python gnosis -e, to get bookmarks for current epub in calibre (by reading first entry in recent files in viewer settings, … -i ~/folder to import epub, or -g file.pdf, and it should complain about missing libraries.

SQLite support I haven't looked at in a while. Part of it was not being familiar enough with Python or Peewee, at the time I began, to use one database or another at runtime, and part of it was some other details including including an updated sqlite for FTS5 full-text search (for search all), and how to use that binary instead of what's included with macOS. I might be able to figure that out now though I haven't looked at it. For the most part, running the workflow with various command-line options should show what libraries are needed (jpdfbookmarks as well) and besides that, postgresql with user postgres with create table permissions. I should include jpdfbookmarks but was trying to keep the git repo size small, and might switch to a faster PDF bookmark parser. I tried pdfminer recently and it looks to work well; I need only to rewrite the code that figures out searchable bookmark section (e.g., section 1 > chapter 1. title). The code for that is ugly anyway. I think that's about it for setup. Perhaps it's not too much to get it ready to release though I've been working on other things for a while and recently got back to this only in the last week or so.

DB_ENGINE=postgresql in workflow environment variables but perhaps that's in a .plist and already saved in git. It may also be necessary to manually create some postgres indices.

As my python and even overall programming experience is limited, any contributions would be fantastic!

an example of search all for ePub (I need to change the workflow icon):

Edited February 20, 2017 by h2ner

h2ner · February 27, 2017

Progress continues and for a while I was almost ready to post a version for testing. It was a long way getting here; parsing PDF and ePub bookmarks, exploring ways and libraries to store them while imaging future additions, SQL full-text search across bookmark title, section, and file name, etc., quite a bit of work has gone into it. Dependencies should be mostly added and initial SQLite support so it can just work without setup, that should be there. Exploring DB migrations, I'm not sure how much I can later support that, or how difficult it will be, so I'm trying to finalize the schema for future uses of more than bookmarks, for instance browsing books by author, publisher, tags, ratings, etc., auto-reimporting bookmarks if file date modified changes, and who knows what else. I'm not sure who else is interested; perhaps more will show themselves once they start using it, and if they use ebooks regularly, discover the power of being able to search all books by chapter/section title and go to it instantly. Cloning git and using deanishe's scripts to symlink or install, while also changing the git stored environment variable for DB from postgresql to sqlite, and it should work. Recent DB changes should be pushed soon; for anyone that wants to test, be ready to manually migrate if needed.

Sign In

Workflow Request: Searching sections/bookmarks in pdfs

Recommended Posts

mbigras

Link to comment

dunkaroo

Link to comment

dunkaroo

Link to comment

dfay

Link to comment

mbigras

Link to comment

dfay

Link to comment

dfay

Link to comment

h2ner

Link to comment

dfay

Link to comment

dfay

Link to comment

h2ner

Link to comment

h2ner

Link to comment

h2ner

Link to comment

mbigras

Link to comment

h2ner

Link to comment

h2ner

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity