Jump to content
jmm28260

Workflow to identify language in pdf

Recommended Posts

I am trying to automate language recognition of OCR'eds pdf in order to flag the file accordingly. Would Alfred be able to face the challenge with a workflow ?

Any help welcome. Thanks.

 

Share this post


Link to post
Posted (edited)

Do you want to manually select the files to process or otherwise manually trigger whatever program you're using for language recognition?

Edited by deanishe

Share this post


Link to post

The file is handled through an Automator workflow, and after OCR I would like its language to be identified and flagged accordingly to be later dispatched to the appropriate folder.

Share this post


Link to post

I was hoping to automate the pdf text language identification. Would Applescript be able to do it ?

Thanks for your time and efforts, anyway.

Share this post


Link to post
Posted (edited)
17 minutes ago, jmm28260 said:

I was hoping to automate the pdf text language identification

 

I understand that. I don't understand which part of that you thought Alfred might be able to do.

 

17 minutes ago, jmm28260 said:

Would Applescript be able to do it ?

 

No. You're trying to process PDFs, so you need to start with something that can actually understand PDF files.

 

If the file has metadata, you can extract the language from that with exiftool, for example. In a shell, that would look like:

exiftool -t -Language /path/to/file.pdf | cut -d$'\t' -f2
Edited by deanishe

Share this post


Link to post

Here is the Applescript I use:

 

try
    tell application "FineReader"
        activate
        open theFile
        tell application "System Events"
            tell process "FineReader"
                tell menu bar 1
                    tell menu "Document"
                        tell menu item "OCR Text Recognition"
                            tell menu 1
                                click menu item "Recognize Text Using OCR..."
                            end tell
                        end tell
                    end tell
                    
                end tell
                keystroke return
            end tell
        end tell
        save the front document
        close the front document
        close application "FineReader"
    end tell
end try

 

Can I add a few lines to extract the metadata and identify the language ?

Share this post


Link to post

I have no idea.

 

You're asking questions about FineReader, so you should ask on a FineReader-related forum, where there are people who know the software.

 

 

Share this post


Link to post
On 1/1/2019 at 4:21 PM, jmm28260 said:

Here is the Applescript I use:

 

try
    tell application "FineReader"
        activate
        open theFile
        tell application "System Events"
            tell process "FineReader"
                tell menu bar 1
                    tell menu "Document"
                        tell menu item "OCR Text Recognition"
                            tell menu 1
                                click menu item "Recognize Text Using OCR..."
                            end tell
                        end tell
                    end tell
                    
                end tell
                keystroke return
            end tell
        end tell
        save the front document
        close the front document
        close application "FineReader"
    end tell
end try

 

Please don't nest tell application blocks inside each other when you don't need to (which is basically ever).  It's bad in many ways.  You should group things together so that the stuff FineReader does goes only inside a tell app "FineReader" block.  The System Events stuff goes separately inside its own block, etc.


Anyway, you might not need to use that script.  You said you were doing this in Automator.  Automator has an action specifically for extracting text from a PDF file:

 

1143700268_ScreenShot2019-01-03at4_20_35.thumb.jpg.b8cc0d6a8b28fc5268b0bc1bd18a94a6.jpg

 

Once you have plain text, detecting the language with AppleSript is pretty easy.  Say you store your string in an AppleScript variable called input_string, then the following code will return the language code for what it believes to be the most likely, dominant language that makes up the string (e.g. "en" for English, "de" for German, etc.):

 

use framework "Foundation"

 

property this : a reference to current application

property NSLinguisiticTagger : a reference to NSLinguisiticTagger of this

 

NSLinguisticTagger's dominantLanguageForString:input_string

result as text

Share this post


Link to post

Thank you so much for your time and efforts.

I tried your Applescript on a pdf file, and it didn't work.  I guess I probably messed it up.

What I am trying to do, is have the workflow read the page, get the title, and add to it the resulting text of your Applescript, i.e.Title +EN if the text is english or +FR if it is in french for instance. At the following link you will find the messed up Automator workflow. If you have a minute, can you help ?  Many thanks.

https://www.dropbox.com/sh/uno2vet0doe7t0b/AAAlwDqqlzkeOMT0b7dNIEOTa?dl=0

Share this post


Link to post
13 hours ago, CJK said:

use framework "Foundation"

 

property this : a reference to current application

property NSLinguisiticTagger : a reference to NSLinguisiticTagger of this

 

NSLinguisticTagger's dominantLanguageForString:input_string

result as text

 

This is extremely cool. I had no idea it's so easy to use macOS's language identification.

Share this post


Link to post
6 hours ago, jmm28260 said:

I guess I probably messed it up

 

Yeah, it's a bit of a mess. You keep setting Titre to various filepaths, not the PDF's title and you're using CJK's AppleScript incorrectly. You can't just copy-and-paste it (the variable input_string doesn't exist). You have to read the contents of the text file you create and pass that to the script.


I tried to fix it myself, but I couldn't because I'm only able to use programming languages that aren't completely stupid while this requires AppleScript. Perhaps CJK can fix it.

 

Share this post


Link to post
21 hours ago, jmm28260 said:

At the following link you will find the messed up Automator workflow. If you have a minute, can you help ?

 

Well, that took much more than just a minute.  Partly because your workflow was just kinda ugh, and partly because Objective-C is really bloody annoying on occasion, and the AppleScript needed to be re-written to cope with multiple file inputs, and because Automator can't do repeat loops by itself.


Here's a screenshot of the Automator workflow, which now only has four actions:1230461229_ScreenShot2019-01-04at10_00_04.thumb.jpg.ce5f096cd8e5519cebb2aa41dcb61a12.jpg

 

The modified AppleScript for use in the Run AppleScript action is below.  Largely, the screenshot and script are for the benefit of anyone viewing this post at a later date, from which they can piece together the workflow themselves because I still haven't set up a base for storing permanent links to fileshares, so this one will be temporary:

 

Append Language To Name of PDF File.workflow.zip

use framework "Foundation"

property this : a reference to current application
property NSFileManager : a reference to NSFileManager of this
property NSLinguisticTagger : a reference to NSLinguisticTagger of this
property NSString : a reference to NSString of this

property nil : a reference to missing value

on run [fs, null]
	script fileURLs
		property list : fs
	end script
	
	set FileManager to NSFileManager's defaultManager()
	
	repeat with f in the list of fileURLs
		try
			set lang to "_" & ((NSLinguisticTagger's ¬
				dominantLanguageForString:(NSString's ¬
					stringWithContentsOfURL:f)) as text)
			
			set basename to (NSString's stringWithString:(f's ¬
				POSIX path))'s stringByDeletingPathExtension()
			set oldname to (basename's ¬
				stringByAppendingPathExtension:"pdf")
			set newname to ((basename's ¬
				stringByAppendingString:lang)'s ¬
				stringByAppendingPathExtension:"pdf")
			
			tell the FileManager to moveItemAtPath:oldname ¬
				toPath:newname |error|:nil -- Rename PDF file
			tell the FileManager to trashItemAtURL:f ¬
				resultingItemURL:nil |error|:nil -- Delete text file
		end try
	end repeat
end run

Share this post


Link to post

Waouh! You've done a terrific job! Thanks so much; I really appreciate your help.

I've run the workflow, it works fine, but unfortunately, the name comes out unchanged, without the language specification. I'm sure you've checked it, so is there something I missed ?

Share this post


Link to post

Ah crap. I was hoping to learn how to read files in AppleScript (which was the bit that had me tearing my hair out), but you used ObjC instead.

 

Still, that’s also a perfectly fine solution.

Share this post


Link to post
19 hours ago, jmm28260 said:

it works fine, but unfortunately, the name comes out unchanged, without the language specification. I'm sure you've checked it, so is there something I missed ?

 

As a tip to help get your problems solved faster, it's typically not very useful when you state that something doesn't work, then ask if I know what's wrong.  I need a lot more information than that to be able to diagnose the potential issues.


The dialog, whilst a sensible thing to screenshot and share, is sadly not especially helpful in this instance, but that's Automator's fault, not yours.  It gives the vaguest errors, by informing you that there's a problem, and then assuming we all love needles in haystacks.


What you should do first is edit the AppleScript and remove the line that says try, and remove the line that says end try.  Then don't forget to press the hammer icon button, which recompiles the AppleScript code (do this whenever you make an edit to the code and before you run the workflow again).

 

Then, what I need to know are the following:

  • What version of macOS you're running ?
  • What happens inside Automator itself when you run the workflow ?
    • Which actions are completed successfully, with green ticks by them ?
    • Which actions fail to complete, particularly which one fails first ?
    • Of the actions that complete with green ticks, do any of them in the results section have an empty output ?
  • Did you set the value of the directory variable to an appropriate value ?
  • Did you change the location where the text files generated from the PDFs get saved ?

Share this post


Link to post

My Mac Os is 10.14.2

I removed the lines with try and end try and followed all your instructions.

All actions get results, but they are all with the .txt file. No pdf file appears in any action results except for the first one. And I don't get any pdf file in the directory mentioning the language when the workflow is completed. The last result is the .txt file without mention of the language.

I did not change anything, except for the directory and they keep the same location.

Hope this helps.

 

 

Capture d’écran 2019-01-05 à 11.24.55.png

Capture d’écran 2019-01-05 à 11.25.50.png

Share this post


Link to post

@jmm28260 OK, thanks for this.  Would you mind uploading your workflow as it is and I'll take a look at it on my system, otherwise we could spend days going back and forth here.  I think that'll be easiest for us both.

Share this post


Link to post
On 1/5/2019 at 10:39 AM, jmm28260 said:

I did not change anything, except for the directory and they keep the same location.

 

So that was a lie:

 

image.thumb.jpeg.a646d1bfe0f6f40cd048e9bdd542e45e.jpeg

 

In the first action, you have the search folder set to 1. OCR.  In the second action, you changed the output folder to Desktop.


Now compare it with the workflow that I sent to you:

 

image.thumb.jpeg.1007bb332fe157b63d299c7f2ad6307c.jpeg

 

Both the search directory and the output directory are the same, and I created a dedicated variable for it that holds the path.  It's called directory and it's stored with the Automator variables that are accessible at the bottom of the window by clicking one of the buttons down there.  Double-click the directory variable, and you can set the path through this, which will set both the search path, and the output path.


[ PS. You can actually delete the Set Variable action and fileList variable.  That variable doesn't end up being used, so it and the action that creates it are superfluous. ]

 

Share this post


Link to post
5 hours ago, jmm28260 said:

Sorry for that. My mistake, I sent you the wrong workflow. But the problem remains as you can see:

 

No, I can't see, because you cut off the top of the workflow, so only the output folder was visible. 


Anyway, let's try again.  Please upload your correct workflow, and I'll have another look.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...