Workflow to identify language in pdf

jmm28260 · January 7, 2019

Here we go: https://www.dropbox.com/sh/6grp8odam5y076i/AABogVAk7WmSUUapIfu02Tv3a?dl=0

CJK · January 7, 2019

It's still showing that you've set the input folder to one thing, and the output folder to another.
I can even see you edited the path of the directory variable to lead to a second folder called 2. Renommer

When I assigned the variable directory back to both the input and the output folders, then changed the path of directory to point to my test folder where I copied in a bunch of PDFs, the workflow ran as expected.

My suggestion is that you first try and use the workflow exactly as I've laid out, and then you can play around with it a bit more when you are more familiar with what does what, etc.

If you still have problems, you need to be very specific, and pedantic, about where the very, very, first unexpected occurrence takes place. You're using Mojave (I'm on High Sierra), so it's possible your Automator has more bugs in it (it sounds to me from the things I've heard that Mojave has been a nightmare for AppleScript and Automator). So, look at each action individually - for example, the first action is a Finder action, and they are notoriously buggy in some situations. Therefore, make sure it is actually finding your PDF files successfully (all of them).

The script portion of the workflow appears to be working fine, even on your end. I know it's not producing the output you want yet, but there's no error being thrown and it completes its run. To me, it suggests that the problem is in the actions preceding it that are not set correctly, and not producing the output necessary.

Worst comes to the worst, I can do a re-write of the whole thing. What I would do is replace the Finder action with another AppleScript action instead, and get that to retrieve the PDF files reliably. Then I'd recode the main AppleScript action to use less Objective-C and more AppleScript, as Objective-C can be quite strict with security protocols, and it's possible (but unlikely) that it's not permitting Automator to make changes to your filesystem.

But you do your bit first. Then give me a couple of days and I'll see what's going on.

jmm28260 · January 7, 2019

Indeed, Mojave might be the source of the problem; since I updated my Os, several of my Automator workflows do not work anymore.

But in this case, the workflow rolls fine; all the actions are performed and green-checked. The only issue that I see, is that the Applescript gets a .txt file and results with the same file and not a .pdf file with the lang appended to the name. My guess is that Mojave does not accept and treat the Applescript as it should.

I might have to wait for Apple to correct that problem with its future updates.

However, I really want to thank you for all the time you have devoted to my problem.

jmm28260 · January 8, 2019

Problems between Mojave and Applescripts are cited all over the net.

I tried to find some explanations on the web and here are 2 possibilities that might help:

I also found a possible answer that was suggested on a forum and that might work:

This worked for me, but it requires converting to an app first!

1. Open the app's plist file is xcode
2. add row (from right click context menu)
3. in ‘key’ column select ‘Privacy - AppleEvents Sending Usage Description’ from the drop down menu (you need to scroll down)
4. add ‘This script needs to control other applications to run.' in the value column.
5. Build the application again... it should now prompt for accessibility and automation permissions.

Hope it helps...

CJK · January 18, 2019

@jmm28260 Sorry for the delay—been unwell. (Also, it's a good idea to tag the person to whom you're replying in your post using the @ symbol followed by their username, otherwise they won't necessarily be notified by your reply; or, at least, I'm not, despite having the option selected).

On 1/8/2019 at 5:57 PM, jmm28260 said:

I tried to find some explanations on the web and here are 2 possibilities that might help:

On 1/8/2019 at 5:57 PM, jmm28260 said:

I also found a possible answer that was suggested on a forum and that might work:

OK, well done for finding those bits and pieces. I'll leave those for you to attempt to implement if you really feel any of them are the problem. I don't have Mojave and I don't intend to, so those recommendations aren't really something I can play around with.

In the meantime, I rewrote the Automator workflow from scratch. The workflow itself has been dramatically simplified because the AppleScript portion has taken over all of the functionality. I did this for a couple of reasons: 1) I wanted to reduce the number of workflow actions in which problems might potentially arise. There are now only 3 actions, and none of them need to be touched or edited, because there are no directory selections to be made for input and output file sources that was a source of contention for me. The only part of the workflow that requires customising is the path stored in the directory variable, and this can only be edited in the bottom pane where variables are stored in Automator:

2) Being now entirely script based, it means one can also just copy and paste the AppleScript into Script Editor and run it from in there. This will be a useful thing to do particularly if the Automator workflow fails on you. Script Editor will be a lot more helpful with its error messages that will make pinpointing any scripting issues a bit easier. Also, if it works in one environment, but not the other, then that's an entirely different problem that I doubt I'll be able to help with.

Using this new method, there are no new files created at any point, so no text files will appear in the PDF folder or on the desktop. The only effect the script will (should) have run from either environment is to rename the PDF files by appending the language code to the end of the name, then revealing those files in Finder. The script doesn't return any value, so don't worry if the workflow results appear empty at the end.

Here's the AppleScript:

use framework "Foundation"

use framework "Quartz"

use scripting additions

property this : a reference to current application

property NSArray : a reference to NSArray of this

property NSLinguisticTagger : a reference to NSLinguisticTagger of this

property NSMutableDictionary : a reference to NSMutableDictionary of this

property NSString : a reference to NSString of this

property PDFDocument : a reference to PDFDocument of this

property samples : 4 -- The (maximum) number of pages to sample for text

on run filepaths

set [filepaths] to filepaths & {null}

if class of filepaths = script or filepaths = {} then set ¬

filepaths to [(choose file of type ["com.adobe.pdf"] ¬

with multiple selections allowed), null]

set [filepaths, null] to filepaths

if class of filepaths ≠ list then

set directory to POSIX file (POSIX path of filepaths) as alias

tell application "Finder" to set filepaths to (every file ¬

in the directory whose name extension = "PDF") ¬

as alias list

end if

set PDFs to {}

repeat with PDFPath in filepaths

set lang to probableLanguageForPDF at PDFPath

set end of PDFs to my (stick on "_" & lang to PDFPath)

end repeat

tell application "Finder"

reveal the PDFs

activate

end tell

end run

# stick

# Appends a suffix to the filename (without extension) of the file at the

# specified path, without altering the file extension

to stick on suffix to fp as text

local fp, suffix

set filename to null

tell (NSString's stringWithString:(fp's POSIX path)) to if ¬

false = ((the lastPathComponent()'s ¬

stringByDeletingPathExtension()'s hasSuffix:suffix)) ¬

as boolean then set filename to ¬

(((the lastPathComponent()'s ¬

stringByDeletingPathExtension()'s ¬

stringByAppendingString:suffix))'s ¬

stringByAppendingPathExtension:(the ¬

pathExtension())) as text

tell application "System Events" to tell the item named fp

if filename = null then return it as alias

set dir to its container

set its name to filename

return the item named filename in dir as alias

end tell

end stick

# probableLanguageForPDF

# Obtain the most likely language of a PDF file based on sampling a small

# number of its pages and returning the most commonly detected language code

on probableLanguageForPDF at PDFPath as text

local PDFPath

set PDFFileURL to POSIX file (PDFPath's POSIX path) as alias

set PDF to PDFDocument's alloc()'s initWithURL:PDFFileURL

set PDFTitle to the PDF's documentAttributes()'s |Title| as text

set N to the PDF's pageCount() as integer

-- Ignore first and last page unless they are the only pages

set PDFPageNumbers to array(0, N - 1)

set a to item 2 of (PDFPageNumbers & {0})

set b to item -2 of ({N - 1} & PDFPageNumbers)

if a > b then set [a, b] to [b, a]

set langs to NSMutableDictionary's dictionary()

-- The language of the PDF's title

set PDFTitleLang to (NSLinguisticTagger's ¬

dominantLanguageForString:PDFTitle)

langs's setValue:1 forKey:PDFTitleLang

-- Select only a small sample of pages

-- to obtain a language for each

repeat with i from a to b by N div samples + 1

set PDFPage to the (PDF's pageAtIndex:i)

set PDFPageText to PDFPage's |string|()

set PDFPageLang to (NSLinguisticTagger's ¬

dominantLanguageForString:PDFPageText)

set [x] to references in {langs's valueForKey:PDFPageLang} & {0}

(langs's setValue:((x as integer) + 1) forKey:PDFPageLang)

end repeat

-- The most common language identified

set lang to the last item of (langs's ¬

keysSortedByValueUsingSelector:"compare:")

lang as text

end probableLanguageForPDF

# array()

# Generate a list of consecutive (ascending) integers between +a and +b

on array(a as integer, b as integer)

local a, b

if a > b then set [a, b] to [b, a]

script |integers|

property list : {}

end script

repeat with i from a to b

set the end of the list of |integers| to i

end repeat

return the list of |integers|

end array

Finally, here's the Automator workflow file, which any future readers for whom this link will be broken will have no problems creating themselves by copy-n-pasting the above script. It will work as a single AppleScript action inside Automator, or you can copy the action workflow depicted in the image above that simply sets a directory.

https://transfer.sh/6CGwf/Append Language To Name of PDF File.workflow.zip

@jmm28260, report how things turn out with this iteration, either from Automator, or from Script Editor, or from both. If Automator fails, report how and why, but then definitely run from within Script Editor to confirm the error is script-generated rather than app-generated, and to get a more focused error report.

Edited January 22, 2019 by CJK
Important edits made to AppleScript code; updated the download link for the Automator workflow

dfay · January 19, 2019

@CJK thanks for your work on this! Looking forward to having a bit of time to work through how you did the Applescript / ObjC integration.

CJK · January 20, 2019

On 1/19/2019 at 12:18 AM, dfay said:

@CJK thanks for your work on this! Looking forward to having a bit of time to work through how you did the Applescript / ObjC integration.

Great. Always happy to hear thoughts/comments/etc.

jmm28260 · January 20, 2019

@CJK Thanks so much for your efforts. I am impressed by your work and thank you for your time.

I have tried using the Script editor in Mojave with your Applescript and here is the error message I get :

And I have used the Automator you provided following your instructions precisely, specified a definite directory to look into, where I had put a pdf file, and got the following message:

L’action « Exécuter un script AppleScript » a rencontré une erreur : « *** -[__NSDictionaryM setObject:forKey:]: key cannot be nil »

Looks like Mojave is reluctant to accept some instructions.

CJK · January 22, 2019

On 1/20/2019 at 12:27 PM, jmm28260 said:

@CJK Thanks so much for your efforts. I am impressed by your work and thank you for your time.

That's ok. You didn't tag me correctly. The correctly-tagged name will appear in purple. When you type @CJK, you should obtain a list of users that contain these letters in their username, and you can click on the appropriate one.

On 1/20/2019 at 12:27 PM, jmm28260 said:

I have tried using the Script editor in Mojave with your Applescript and here is the error message I get :

Erm... Yeah, that's because you double-spaced the whole script. You've got a blank line in between every non-blank line. I have no idea how you accomplished that. I did a copy-n-paste test from my previous post and it pasted correctly.

On 1/20/2019 at 12:27 PM, jmm28260 said:

L’action « Exécuter un script AppleScript » a rencontré une erreur : « *** -[__NSDictionaryM setObject:forKey:]: key cannot be nil »

This is really helpful actually, because it finally pinpoints the nature of the error and where it's coming from. This is being generated by the sub-routine (handler) called filterContents. It uses an Objective-C data class called NSFileManager to access and manipulate filesystem objects. Basically, NSFileManager needs certain permissions/authorisation to be allowed to access your filesystem, and your system hasn't granted those access rights. I could play around with getting an authorisation request from the script side of things; or you could play around with physically granting those privileges on a permanent basis. Neither of these things are things I know how to do off the top of my head, so it's actually easier to get rid of that handler and get Finder to do the job that NSFileManager was doing.

To summarise the changes I am making to the script:

① This handler is now defunct and will be deleted completely:

On 1/18/2019 at 10:02 PM, CJK said:

to filterContents at directory by extension as list

local directory, extension



set directory to directory's POSIX path



script dir

property filenames : ((NSFileManager's defaultManager()'s ¬

contentsOfDirectoryAtPath:directory ¬

|error|:(missing value))'s ¬

pathsMatchingExtensions:extension) as list

property filepaths : {}

end script



repeat with fp in dir's filenames

try

set end of filepaths in dir to ¬

POSIX file (((NSString's ¬

stringWithString:directory)'s ¬

stringByAppendingPathComponent:fp) ¬

as text) as alias

end try

end repeat



dir's filepaths

end filterContents

② This is where the line of code crops up that makes use of the now-defunct handler. I will be changing this block:

On 1/18/2019 at 10:02 PM, CJK said:

if class of filepaths ≠ list then

set directory to POSIX file (POSIX path of filepaths) as alias

set filepaths to filterContents at directory by ["pdf", "PDF"]

end if

to this:

if class of filepaths ≠ list then

set directory to POSIX file (POSIX path of filepaths) as alias

tell application "Finder" to set filepaths to (every file ¬

in the directory whose name extension = "PDF") ¬

as alias list

end if

This will most likely fix the error. It now clarifies why the first workflow wasn't working for you, because that one relied on NSFileManager to do all of it's filesystem calls. Following these edits, the script and workflow will now completely rely on Finder and System Events for its filesystem calls.

To avoid getting these posts even more bogged down with long script texts, I'll make the changes to the script in the previous post, so you can copy and paste it from there. Alternatively, since you had problems copying-and-pasting that code, you can download a copy of the AppleScript file from here:

https://github.com/ChristoferK/AppleScriptive/blob/master/scripts/Append PDF Language To Filename.applescript

The workflow with the updated script can be obtained here: https://transfer.sh/6CGwf/Append Language To Name of PDF File.workflow.zip

I will update the link from the previous post too.

jmm28260 · January 22, 2019

@CJK

jmm28260 · January 22, 2019

@CJK

Thanks for your answer. Here is the error message I get when I run the Applescript on Mojave.

CJK · January 22, 2019

*Sigh*

My guess here is that the property of the PDF's attributes are specific to ones locale, and so your metadata item that holds the document's title is likely not called Title, but possible Titre. To confirm my hunch, you could change this line:

set PDFTitle to the PDF's documentAttributes()'s |Title| as text

to this:

error PDF's documentAttributes() as record

which will show you what the attribute names and values are in the error dialog that pops up. It's also possible that the specific metadata item for the document title wasn't filled in for the particular document you're testing the script on, as they are optional.

In both of these scenarios, the fix for it should be to change that line I just highlighted above to this:

set PDFTitle to PDF's documentAttributes()'s ¬

objectForKey:(PDFDocumentTitleAttribute of this)

Using the PDFDocumentTitleAttribute constant should return a value appropriate for your system; and using objectForKey to access the property as an NSDictionary rather than as a property that is accessed through a method will handle properties that don't exist a lot more gracefully by returning missing value.

I've already made the change and updated the script file on GitHub.

Edited January 22, 2019 by CJK

jmm28260 · January 22, 2019

@CJKHi again,

I do understand your sighs ;)) but unfortunately the game does not seem over ... Here is a new error message, after fixing the previous one as indicated.

I always test with different pdf to make sure the error does not come from the file.

CJK · January 22, 2019

Change these lines:

set PDFTitleLang to (NSLinguisticTagger's ¬

dominantLanguageForString:PDFTitle)

langs's setValue:1 forKey:PDFTitleLang

to this:

tell PDFTitle to if missing value ≠ it then tell ¬

(NSLinguisticTagger's dominantLanguageForString:it) to if ¬

missing value ≠ it then langs's setValue:1 forKey:it

File updated on GitHub.

Edited January 22, 2019 by CJK

jmm28260 · January 22, 2019

@CJK Hi again,

New message

CJK · January 22, 2019

Why did you post the same problem from earlier ? That one has been dealt with. You've posted a screen shot that shows you're using an older version of the script.

jmm28260 · January 22, 2019

@CJKSorry, my mistake: I sent the wrong snapshot...

CJK · January 22, 2019

Would you mind doing me a favour and just sending me the PDFs that you are using for testing purposes ? That way, I can see what sorts of content you're dealing with.

This latest error, whilst very similar in nature to the one affecting the Title of the PDF document, seems odd that it would arise when dealing with a page from the PDF. It implies that there was no language detected, which in turn implies the page had no text. I can perhaps believe this might be the case for one random PDF file, but if you're presumably testing different files, then the others ought to work.

jmm28260 · January 22, 2019

@CJKThanks a lot. I changed file, and it seems to be working now. I really admire your competence and your tenacity!

One last thing that is strange: when I feed the script with italian text, I may get .en with some files or rightfully so, .it. Apparently its detection of italian is erratic ... I guess that has nothing to do with the script. Again, thanks a lot and great job!!

CJK · January 22, 2019

1 hour ago, jmm28260 said:

I changed file, and it seems to be working now

That's good to hear. However, it was still an error that occurred in one file, so does require fixing, as there may be other PDFs that have blank pages. I've updated the file on GitHub with a fix for that, and a couple of other minor adjustments that I also realised were potential weak spots in those sorts of one-off situations.

2 hours ago, jmm28260 said:

One last thing that is strange: when I feed the script with italian text, I may get .en with some files or rightfully so, .it. Apparently its detection of italian is erratic

No idea I'm afraid. That one will be down to the language tagging function, which isn't something I'm able to change. Mojave does actually have a newer language detection class called NLLanguageRecognizer, which I'm uncertain what the differences are between that and the one my script uses. But I am running High Sierra, so don't have access to the newer one in order to try it out.

Without knowing the content of the PDFs that you were expecting to be tagged as being Italian, I can't judge any probable things to investigate. But, you could always try increasing the number of pages that are sampled in each PDF. There's a line in the first section of the script where all the properties are declared:

property samples : 4

If you increase this number, the script will use more pages to identify the language.

Just in case you're interested, the current method of decision making for the language is done by detecting the language on each sampled page. If different pages produce different results, then the script chooses the language that was tagged in the most number of pages from the sample. So if your PDF had a mixture of Italian and English in it, and the sample pages happened to feature more densely occurring English than Italian, it would tag the file overall as English.

The other way I contemplated doing it was combining the sample pages together into one long piece of text, and detecting the language the that as if it were a single page. That would always produce one affirmative result. Then, of course, the last possibility is to remove the sampling altogether and just assess the entire PDF file in one go. However, I avoided that because some of my PDFs are quite large, and if the script had to convert many PDFs each with hundreds of pages, that's obviously going to slow it done compared to just sampling 4 pages from each PDF without necessarily producing better results.

Anyway, if you have a particular preference over which method should be used, just let me know and I can change the implementation quite easily.

jmm28260 · January 23, 2019

@CJK Thanks so much for this terrific job. I really appreciate your time and efforts.

After getting your last version, I ran it and I think something was deleted in your procedure, since it produced no file at all. I have included it also in my Automator workflow, and all the actions preceding the Applescript got results, but nothing came out of the script. No results shown.

https://www.dropbox.com/sh/q8rp4cs8lmi7zb4/AADpmmhTSHL4wqNLldH6fuSla?dl=0

Would you mind checking it ?

Workflow to identify language in pdf

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in