Jump to content

Problem with packaged Python libraries


Recommended Posts

Hi! I've put together a workflow to interact with Obsidian. It relies heavily on some Python libraries like NLTK and Gensim, and I've done my best to package them, following advice from @deanishe found elsewhere on this forum.

 

However, I get an error when I try to run it on a different computer:


```
import regex._regex as _regex "ModuleNotFoundError:" No module named ‘regex._regex’

```

 

For what I've gathered, that may have something to do with the fact that Regex (required by NLTK) comes with a precompiled binary that may not play well with the other system or its python interpreter. 
 

All my scripts have a shebang pointing to `/usr/bin/python3` and I'm comfortable with it as a minimum requirement, but is there any hope of properly packaging Regex with the workflow?

Edited by Miguel Tavares
Link to post
1 hour ago, Miguel Tavares said:

For what I've gathered, that may have something to do with the fact that Regex (required by NLTK) comes with a precompiled binary that may not play well with the other system or its python interpreter.

 

This. The name of the binary is "_regex.cpython-37m-darwin.so", which means it's only compatible with Python 3.7. Same goes for all of scipy, numpy and gensim.

Edited by deanishe
Link to post

Short term, you could make sure you install the dependencies in the workflow with Python 3.8, which is what Catalina and Big Sur have, but it likely won't work with the next version of macOS. And I presume that you personally need 3.7 for some reason (3.7 was never part of macOS, AFAIK).

 

Personally, I'd rethink the workflow. It's 300MB, which is nuts. I have over 250 workflows, and yours is larger than all of them combined.

 

What are you doing with gensim and NLTK? Couldn't you use sqlite (included with Python) for fulltext search, instead?

Link to post

You're right. The workflow is bigger that Obsidian itself.

 

There's a "Related Notes" feature that looks for similar text files. It initially used the simpler Jaccard similarity algorithm, but then I thought "Hey, let's do it properly with TF-IDF". So I imported NTLK for word tokenisation and Gensim for vectorization. It worked fine (on my computer), but everything else went sideways.

 

I don't know a thing about sqlite (or even what role a database would have in this feature), so I'll probably just go back to Jaccard and try to keep things simple.

 

Thanks for your help and advice.

Link to post
5 hours ago, Miguel Tavares said:

I don't know a thing about sqlite (or even what role a database would have in this feature)

 

It has fairly advanced fulltext search capabilities, including Porter stemming, and is extremely fast. As I indicated, I didn't really understand what you're doing with gensim and NLTK.

Link to post
2 minutes ago, deanishe said:

 

I have no idea if it's any use for your "related documents" feature.

It probably is, I think. If it has porter stemming (something I'd never imagine), I assume that stop word removal will be trivial. Then I can just feed a full text search with the "input document" and see what comes out. Right?

Link to post
3 minutes ago, Miguel Tavares said:

I assume that stop word removal will be trivial

 

Pretty sure it doesn't support stop words. You might need to set your own custom tokeniser for that. Here are the full-text search docs: https://www.sqlite.org/fts5.html

 

I usually use a custom ranking function to apply different weightings to different columns (e.g. title vs tags vs body).

Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...