Jump to content

Extract Keywords


Recommended Posts

image.png.91c95121d1036ced58b29a1e34b755cb.png

Extract Keywords

Extract keywords and keyphrases from articles, books or any other document with YAKE!

 

Download On Github

 

 

Usage

  • Send PDF, docx, doc, rtf or txt documents to the workflow’s File Actions
  • Pass the text from your selection in macOS on to the workflow’s Universal Action
  • Use the keyword and paste your text (default: kw

image.thumb.png.00e07e17d2ebfad19f538db5b91d0779.png

The extracted keywords are presented in a dialog.
image.thumb.png.5597a76a198f7555ff792cb37a32ffa4.png


Dependencies
The workflow relies on Python3 to install the YAKE standalone.
YAKE!


pdftotext


Stopwords

 


Yake has internal stopword handling that cannot be influenced from the command line. However, you can still define a list of words that will be flat out purged from the input text. To set up a ‘purge word’-list, create a text file named as the language identifier for a corresponding language in the workflow root folder: assets/stopwords/de.txt.


The workflow checks if the file exists and if it does, the words are removed.


The purge-word files can be quickly accessed through Alfred by prefixing the keyword with a colon (default: :kw).


image.thumb.png.5cc62e8da5e73d628aaa0cc67861ebe9.png



YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text.

Edited by zeitlings
Link to comment
  • 1 month later...

@zeitlings request to consider and add an option to extract key sentences (sentences where key phrases are located). It will be great if the docx file key sentences are not only extracted but also highlighted in yellow i.e extract and highlight key sentences option. 

 

See the code below for sample code in Python. You can add such an option or guide how to incorporate this. 

 

# Code to highlight keysentences using keyphrases extracted from nlp
# import required library
# install textract, rake libraries
import nltk
import textract
from rake_nltk import Rake
from tkinter import filedialog as fd
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import tkinter as tk
import os.path

#nltk.download('stopwords')
#nltk.download('punkt')

filename = fd.askopenfilename()
text = textract.process(filename).decode("utf-8")
rake_nltk_var = Rake(min_length=2, max_length=4, include_repeated_phrases=False, stopwords={'yet', 'would', 'words', 'wise', 'whether', 'wherein', 'went', 'you', 'what', 'usage', 'ultimately', 'to', 'there', 'then', 'that', 'still', 'site', 'when', 'we', 'wants', 'vertical', 'vary', 'try', 'through', 'this', 'so', 'something', 'see', 'shift', 'will', 'said', 'require', 'say', 'or', 'now', 'no', 'much', 'move', 'me', 'part', 'post', 'require', 'related', 's', 'value', 're', 'of', 'it', 'is', 'instead', 'initially', 'let', 'in', 'if', 'i', 'only', 'pick', 'yours', 'within', 'which', 'used', 'use', 'tried', 'those', 'the', 'taken', 'take', 'shows', 'however', 'similar', 'types', 'how', 'work', 'with', 'where', 'way', 'wanted', 'uses', 'us', 'towards', 'typical', 'show', 'same', 'requires', 'remember', 'referred', 'read', 'question', 'volume', 'volumes', 'one', 'two', 'thing', 'things', 'some', 'overview', 'over', 'other', 'various', 'them', 'on', 'off', 'of'}) # min_length and max_length to control words size in phrases and include_repeated_phrases to keep or remove repeated phrases based on true and false

rake_nltk_var.extract_keywords_from_text(text)

keyphrases_extracted = rake_nltk_var.get_ranked_phrases()[:50]  # to increase or decrease the no. of key_phrases to be selected, change value in [ ]
print("Key_Phrases: ")
print(keyphrases_extracted)
print("\n")
print("No. of Keyphrases extracted are:", str(len(keyphrases_extracted)))

from nltk.tokenize import sent_tokenize

search_words = keyphrases_extracted
matches = []
sentances = sent_tokenize(text)
for word in search_words:
   for sentance in sentances:
       if word in sentance:
           matches.append(sentance)
print("Extracted key sentences: ")
print(matches)
print("\n")
print("No. of Key sentences extracted are: ", str(len(matches)))

doc = Document(filename)

for para in doc.paragraphs:
    for items in matches:
        start = para.text.find(items)
        if start > -1:
            pre = para.text[:start]
            post= para.text[start+len(items):]
            para.text = pre
            para.add_run(items)
            para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
            para.add_run(post)

    # Save the output in new doc file at selected file location
    root, extension = os.path.splitext(filename)
    output_filename = root + "_KeySent_Highlights_C1" + extension
    doc.save(output_filename)
Link to comment

Hey @TomBenz, that sounds like a job for a different workflow and somewhat niche.

If you want to adapt the workflow to do that, I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv to a “Run Script” object that runs your python script.

 

The script should look something like this (not at all tested):

 

import os.path
import sys
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
from nltk.tokenize import sent_tokenize

filename = sys.argv[1]
keywords = sys.argv[2]

search_words = keywords.splitlines()
matches = []
sentences = sent_tokenize(text)
for word in search_words:
   for sentence in sentences:
       if word in sentence:
           matches.append(sentence)

doc = Document(filename)

for para in doc.paragraphs:
    for items in matches:
        start = para.text.find(items)
        if start > -1:
            pre = para.text[:start]
            post= para.text[start+len(items):]
            para.text = pre
            para.add_run(items)
            para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
            para.add_run(post)

# Save the output in new doc file at selected file location
root, extension = os.path.splitext(filename)
output_filename = root + "_KeySent_Highlights_C1" + extension
doc.save(output_filename)
sys.stdout.write(output_filename) # e.g. to reveal the file with Alfred

 

Link to comment
15 hours ago, zeitlings said:

Hey @TomBenz, that sounds like a job for a different workflow and somewhat niche.

If you want to adapt the workflow to do that, I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv to a “Run Script” object that runs your python script.

 

The script should look something like this (not at all tested):

 

import os.path
import sys
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
from nltk.tokenize import sent_tokenize

filename = sys.argv[1]
keywords = sys.argv[2]

search_words = keywords.splitlines()
matches = []
sentences = sent_tokenize(text)
for word in search_words:
   for sentence in sentences:
       if word in sentence:
           matches.append(sentence)

doc = Document(filename)

for para in doc.paragraphs:
    for items in matches:
        start = para.text.find(items)
        if start > -1:
            pre = para.text[:start]
            post= para.text[start+len(items):]
            para.text = pre
            para.add_run(items)
            para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
            para.add_run(post)

# Save the output in new doc file at selected file location
root, extension = os.path.splitext(filename)
output_filename = root + "_KeySent_Highlights_C1" + extension
doc.save(output_filename)
sys.stdout.write(output_filename) # e.g. to reveal the file with Alfred

 

 

@zeitlings thanks for your help.  

 

I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv

-- how to do this exactly? Is it possible to post a seperate workflow for it. 

 

image.thumb.png.b4a181ec3c76f15fe405bf5f841e8c4e.png

 

I have tried but no clear how I can pass $loc and keywords from earlier workflow into my Python code. I get two errors:

 

1. line 8, in <module>
    keywords = sys.argv[2]
IndexError: list index out of range

 

2. Upon hardcoding keywords for testing, I get the error

I get line 12, in <module>
    sentences = sent_tokenize(text)
NameError: name 'text' is not defined 

 

I have only basic understanding of Python so trying to learn and make this work. Thanks in advance for your help.

 

 

Link to comment
8 hours ago, TomBenz said:

I'd start with passing (a) the text file location $loc and (b) the keywords, i.e. the query as $1 for zsh argv

-- how to do this exactly? Is it possible to post a seperate workflow for it. 

 

You pass the variables on to a script, for example, like so:

./scripts/highlight.py "$loc" "$1" "$docx"

 

Btw., I found that the example you posted is quite flawed. Only the last sentence of a paragraph that contains a keyphrase will be highlighted as with each loop, the entire work that is done previously gets erased. With this fixed, each sentence containing a keyword is highlighted; the downside then is that you are likely to encounter large chunks of highlighted text. At this point the script forfeits it's usefulness as the desired result is a document with highlighted key sentences. One approach would be to drastically lower the keyword count... or to process really large docx-files. To find out which sentences are really important would require more NLP though.

 

Anyway, I went down the rabbit hole for you: Alfred-Workflow.

 

You will want to play with "highlight.py" a bit. There are three methods you can test:

highlight_keywords to highlight all the keywords.
highlight_sentences to highlight all sentences containing a keyword.
highlight_original to highlight in the flawed fashion described above.

 

Sure made for an interesting Saturday... 😅

 

 

Edited by zeitlings
typo
Link to comment
On 9/30/2023 at 8:34 PM, zeitlings said:

 

You pass the variables on to a script, for example, like so:

./scripts/highlight.py "$loc" "$1" "$docx"

 

Btw., I found that the example you posted is quite flawed. Only the last sentence of a paragraph that contains a keyphrase will be highlighted as with each loop, the entire work that is done previously gets erased. With this fixed, each sentence containing a keyword is highlighted; the downside then is that you are likely to encounter large chunks of highlighted text. At this point the script forfeits it's usefulness as the desired result is a document with highlighted key sentences. One approach would be to drastically lower the keyword count... or to process really large docx-files. To find out which sentences are really important would require more NLP though.

 

Anyway, I went down the rabbit hole for you: Alfred-Workflow.

 

You will want to play with "highlight.py" a bit. There are three methods you can test:

highlight_keywords to highlight all the keywords.
highlight_sentences to highlight all sentences containing a keyword.
highlight_original to highlight in the flawed fashion described above.

 

Sure made for an interesting Saturday... 😅

 

 

Many many thanks @zeitlings for your inputs and time on this. I will experiment with highlight.py and revert in few days.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...