zeitlings Posted August 19, 2023 Share Posted August 19, 2023 (edited) Extract Keywords Extract keywords and keyphrases from articles, books or any other document with YAKE! Usage Send PDF, docx, doc, rtf or txt documents to the workflow’s File Actions Pass the text from your selection in macOS on to the workflow’s Universal Action Use the keyword and paste your text (default: kw) The extracted keywords are presented in a dialog. Dependencies The workflow relies on Python3 to install the YAKE standalone.YAKE! pip install git+https://github.com/LIAAD/yake official installation guide pdftotext brew install poppler formula on brew.sh Stopwords Yake has internal stopword handling that cannot be influenced from the command line. However, you can still define a list of words that will be flat out purged from the input text. To set up a ‘purge word’-list, create a text file named as the language identifier for a corresponding language in the workflow root folder: assets/stopwords/de.txt. The workflow checks if the file exists and if it does, the words are removed. The purge-word files can be quickly accessed through Alfred by prefixing the keyword with a colon (default: :kw). YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Edited August 19, 2023 by zeitlings TomBenz and cands 2 Link to comment
TomBenz Posted September 28, 2023 Share Posted September 28, 2023 @zeitlings request to consider and add an option to extract key sentences (sentences where key phrases are located). It will be great if the docx file key sentences are not only extracted but also highlighted in yellow i.e extract and highlight key sentences option. See the code below for sample code in Python. You can add such an option or guide how to incorporate this. # Code to highlight keysentences using keyphrases extracted from nlp # import required library # install textract, rake libraries import nltk import textract from rake_nltk import Rake from tkinter import filedialog as fd from docx import Document from docx.enum.text import WD_COLOR_INDEX import tkinter as tk import os.path #nltk.download('stopwords') #nltk.download('punkt') filename = fd.askopenfilename() text = textract.process(filename).decode("utf-8") rake_nltk_var = Rake(min_length=2, max_length=4, include_repeated_phrases=False, stopwords={'yet', 'would', 'words', 'wise', 'whether', 'wherein', 'went', 'you', 'what', 'usage', 'ultimately', 'to', 'there', 'then', 'that', 'still', 'site', 'when', 'we', 'wants', 'vertical', 'vary', 'try', 'through', 'this', 'so', 'something', 'see', 'shift', 'will', 'said', 'require', 'say', 'or', 'now', 'no', 'much', 'move', 'me', 'part', 'post', 'require', 'related', 's', 'value', 're', 'of', 'it', 'is', 'instead', 'initially', 'let', 'in', 'if', 'i', 'only', 'pick', 'yours', 'within', 'which', 'used', 'use', 'tried', 'those', 'the', 'taken', 'take', 'shows', 'however', 'similar', 'types', 'how', 'work', 'with', 'where', 'way', 'wanted', 'uses', 'us', 'towards', 'typical', 'show', 'same', 'requires', 'remember', 'referred', 'read', 'question', 'volume', 'volumes', 'one', 'two', 'thing', 'things', 'some', 'overview', 'over', 'other', 'various', 'them', 'on', 'off', 'of'}) # min_length and max_length to control words size in phrases and include_repeated_phrases to keep or remove repeated phrases based on true and false rake_nltk_var.extract_keywords_from_text(text) keyphrases_extracted = rake_nltk_var.get_ranked_phrases()[:50] # to increase or decrease the no. of key_phrases to be selected, change value in [ ] print("Key_Phrases: ") print(keyphrases_extracted) print("\n") print("No. of Keyphrases extracted are:", str(len(keyphrases_extracted))) from nltk.tokenize import sent_tokenize search_words = keyphrases_extracted matches = [] sentances = sent_tokenize(text) for word in search_words: for sentance in sentances: if word in sentance: matches.append(sentance) print("Extracted key sentences: ") print(matches) print("\n") print("No. of Key sentences extracted are: ", str(len(matches))) doc = Document(filename) for para in doc.paragraphs: for items in matches: start = para.text.find(items) if start > -1: pre = para.text[:start] post= para.text[start+len(items):] para.text = pre para.add_run(items) para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW para.add_run(post) # Save the output in new doc file at selected file location root, extension = os.path.splitext(filename) output_filename = root + "_KeySent_Highlights_C1" + extension doc.save(output_filename) Link to comment
zeitlings Posted September 29, 2023 Author Share Posted September 29, 2023 Hey @TomBenz, that sounds like a job for a different workflow and somewhat niche. If you want to adapt the workflow to do that, I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv to a “Run Script” object that runs your python script. The script should look something like this (not at all tested): import os.path import sys from docx import Document from docx.enum.text import WD_COLOR_INDEX from nltk.tokenize import sent_tokenize filename = sys.argv[1] keywords = sys.argv[2] search_words = keywords.splitlines() matches = [] sentences = sent_tokenize(text) for word in search_words: for sentence in sentences: if word in sentence: matches.append(sentence) doc = Document(filename) for para in doc.paragraphs: for items in matches: start = para.text.find(items) if start > -1: pre = para.text[:start] post= para.text[start+len(items):] para.text = pre para.add_run(items) para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW para.add_run(post) # Save the output in new doc file at selected file location root, extension = os.path.splitext(filename) output_filename = root + "_KeySent_Highlights_C1" + extension doc.save(output_filename) sys.stdout.write(output_filename) # e.g. to reveal the file with Alfred TomBenz 1 Link to comment
TomBenz Posted September 30, 2023 Share Posted September 30, 2023 15 hours ago, zeitlings said: Hey @TomBenz, that sounds like a job for a different workflow and somewhat niche. If you want to adapt the workflow to do that, I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv to a “Run Script” object that runs your python script. The script should look something like this (not at all tested): import os.path import sys from docx import Document from docx.enum.text import WD_COLOR_INDEX from nltk.tokenize import sent_tokenize filename = sys.argv[1] keywords = sys.argv[2] search_words = keywords.splitlines() matches = [] sentences = sent_tokenize(text) for word in search_words: for sentence in sentences: if word in sentence: matches.append(sentence) doc = Document(filename) for para in doc.paragraphs: for items in matches: start = para.text.find(items) if start > -1: pre = para.text[:start] post= para.text[start+len(items):] para.text = pre para.add_run(items) para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW para.add_run(post) # Save the output in new doc file at selected file location root, extension = os.path.splitext(filename) output_filename = root + "_KeySent_Highlights_C1" + extension doc.save(output_filename) sys.stdout.write(output_filename) # e.g. to reveal the file with Alfred @zeitlings thanks for your help. I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv -- how to do this exactly? Is it possible to post a seperate workflow for it. I have tried but no clear how I can pass $loc and keywords from earlier workflow into my Python code. I get two errors: 1. line 8, in <module> keywords = sys.argv[2] IndexError: list index out of range 2. Upon hardcoding keywords for testing, I get the error I get line 12, in <module> sentences = sent_tokenize(text) NameError: name 'text' is not defined I have only basic understanding of Python so trying to learn and make this work. Thanks in advance for your help. Link to comment
zeitlings Posted September 30, 2023 Author Share Posted September 30, 2023 (edited) 8 hours ago, TomBenz said: I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv -- how to do this exactly? Is it possible to post a seperate workflow for it. You pass the variables on to a script, for example, like so: ./scripts/highlight.py "$loc" "$1" "$docx" Btw., I found that the example you posted is quite flawed. Only the last sentence of a paragraph that contains a keyphrase will be highlighted as with each loop, the entire work that is done previously gets erased. With this fixed, each sentence containing a keyword is highlighted; the downside then is that you are likely to encounter large chunks of highlighted text. At this point the script forfeits it's usefulness as the desired result is a document with highlighted key sentences. One approach would be to drastically lower the keyword count... or to process really large docx-files. To find out which sentences are really important would require more NLP though. Anyway, I went down the rabbit hole for you: Alfred-Workflow. You will want to play with "highlight.py" a bit. There are three methods you can test: - highlight_keywords to highlight all the keywords. - highlight_sentences to highlight all sentences containing a keyword. - highlight_original to highlight in the flawed fashion described above. Sure made for an interesting Saturday... 😅 Edited September 30, 2023 by zeitlings typo Link to comment
TomBenz Posted October 3, 2023 Share Posted October 3, 2023 On 9/30/2023 at 8:34 PM, zeitlings said: You pass the variables on to a script, for example, like so: ./scripts/highlight.py "$loc" "$1" "$docx" Btw., I found that the example you posted is quite flawed. Only the last sentence of a paragraph that contains a keyphrase will be highlighted as with each loop, the entire work that is done previously gets erased. With this fixed, each sentence containing a keyword is highlighted; the downside then is that you are likely to encounter large chunks of highlighted text. At this point the script forfeits it's usefulness as the desired result is a document with highlighted key sentences. One approach would be to drastically lower the keyword count... or to process really large docx-files. To find out which sentences are really important would require more NLP though. Anyway, I went down the rabbit hole for you: Alfred-Workflow. You will want to play with "highlight.py" a bit. There are three methods you can test: - highlight_keywords to highlight all the keywords. - highlight_sentences to highlight all sentences containing a keyword. - highlight_original to highlight in the flawed fashion described above. Sure made for an interesting Saturday... 😅 Many many thanks @zeitlings for your inputs and time on this. I will experiment with highlight.py and revert in few days. Link to comment
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now