Jump to content

Workflow to identify language in pdf


Recommended Posts

It's still showing that you've set the input folder to one thing, and the output folder to another.
I can even see you edited the path of the directory variable to lead to a second folder called 2. Renommer


When I assigned the variable directory back to both the input and the output folders, then changed the path of directory to point to my test folder where I copied in a bunch of PDFs, the workflow ran as expected.


My suggestion is that you first try and use the workflow exactly as I've laid out, and then you can play around with it a bit more when you are more familiar with what does what, etc.


If you still have problems, you need to be very specific, and pedantic, about where the very, very, first unexpected occurrence takes place.  You're using Mojave (I'm on High Sierra), so it's possible your Automator has more bugs in it (it sounds to me from the things I've heard that Mojave has been a nightmare for AppleScript and Automator).  So, look at each action individually - for example, the first action is a Finder action, and they are notoriously buggy in some situations.  Therefore, make sure it is actually finding your PDF files successfully (all of them).


The script portion of the workflow appears to be working fine, even on your end.  I know it's not producing the output you want yet, but there's no error being thrown and it completes its run.  To me, it suggests that the problem is in the actions preceding it that are not set correctly, and not producing the output necessary.


Worst comes to the worst, I can do a re-write of the whole thing.  What I would do is replace the Finder action with another AppleScript action instead, and get that to retrieve the PDF files reliably.  Then I'd recode the main AppleScript action to use less Objective-C and more AppleScript, as Objective-C can be quite strict with security protocols, and it's possible (but unlikely) that it's not permitting Automator to make changes to your filesystem.


But you do your bit first.  Then give me a couple of days and I'll see what's going on.

Link to comment

Indeed, Mojave might be the source of the problem; since I updated my Os, several of my Automator workflows do not work anymore.

But in this case, the workflow rolls fine; all the actions are performed and green-checked. The only issue that I see, is that the Applescript gets a .txt file and results with the same file and not a .pdf file with the lang appended to the name. My guess is that Mojave does not accept and treat the Applescript as it should.

I might have to wait for Apple to correct that problem with its future updates.

However, I really want to thank you for all the time you have devoted to my problem.

Link to comment

Problems between Mojave and Applescripts are cited all over the net.

I tried to find some explanations on the web and here are 2 possibilities that might help:

I also found a possible answer that was suggested on a forum and that might work:

 

This worked for me, but it requires converting to an app first!

 

  • 1. Open the app's plist file is xcode
  • 2. add row (from right click context menu)
  • 3. in ‘key’ column select ‘Privacy - AppleEvents Sending Usage Description’ from the drop down menu (you need to scroll down)
  • 4. add ‘This script needs to control other applications to run.' in the value column.
  • 5. Build the application again... it should now prompt for accessibility and automation permissions.

Hope it helps...

Link to comment
  • 2 weeks later...

@jmm28260 Sorry for the delay—been unwell.  (Also, it's a good idea to tag the person to whom you're replying in your post using the @ symbol followed by their username, otherwise they won't necessarily be notified by your reply; or, at least, I'm not, despite having the option selected).

 

On 1/8/2019 at 5:57 PM, jmm28260 said:

I tried to find some explanations on the web and here are 2 possibilities that might help:

 

On 1/8/2019 at 5:57 PM, jmm28260 said:

I also found a possible answer that was suggested on a forum and that might work:

 

OK, well done for finding those bits and pieces.  I'll leave those for you to attempt to implement if you really feel any of them are the problem.  I don't have Mojave and I don't intend to, so those recommendations aren't really something I can play around with.


In the meantime, I rewrote the Automator workflow from scratch.  The workflow itself has been dramatically simplified because the AppleScript portion has taken over all of the functionality.  I did this for a couple of reasons: 1) I wanted to reduce the number of workflow actions in which problems might potentially arise.  There are now only 3 actions, and none of them need to be touched or edited, because there are no directory selections to be made for input and output file sources that was a source of contention for me.  The only part of the workflow that requires customising is the path stored in the directory variable, and this can only be edited in the bottom pane where variables are stored in Automator:

 

77737035_ScreenShot2019-01-18at21_52.thumb.jpg.1239c9d03ecda29bd501c1e450d5af88.jpg

 

2) Being now entirely script based, it means one can also just copy and paste the AppleScript into Script Editor and run it from in there.  This will be a useful thing to do particularly if the Automator workflow fails on you.  Script Editor will be a lot more helpful with its error messages that will make pinpointing any scripting issues a bit easier.  Also, if it works in one environment, but not the other, then that's an entirely different problem that I doubt I'll be able to help with.


Using this new method, there are no new files created at any point, so no text files will appear in the PDF folder or on the desktop.  The only effect the script will (should) have run from either environment is to rename the PDF files by appending the language code to the end of the name, then revealing those files in Finder.  The script doesn't return any value, so don't worry if the workflow results appear empty at the end.


Here's the AppleScript:

use framework "Foundation"

use framework "Quartz"

use scripting additions

 

property this : a reference to current application

 

property NSArray : a reference to NSArray of this

property NSLinguisticTagger : a reference to NSLinguisticTagger of this

property NSMutableDictionary : a reference to NSMutableDictionary of this

property NSString : a reference to NSString of this

property PDFDocument : a reference to PDFDocument of this

 

property samples : 4 -- The (maximum) number of pages to sample for text

 

 

on run filepaths

        set [filepaths] to filepaths & {null}

        

        if class of filepaths = script or filepaths = {} then set ¬

                filepaths to [(choose file of type ["com.adobe.pdf"] ¬

                with multiple selections allowed), null]

        set [filepaths, null] to filepaths

        

        if class of filepaths  list then

                set directory to POSIX file (POSIX path of filepaths) as alias

                tell application "Finder" to set filepaths to (every file ¬

                        in the directory whose name extension = "PDF") ¬

                        as alias list

        end if

        

        set PDFs to {}

        repeat with PDFPath in filepaths

                set lang to probableLanguageForPDF at PDFPath

                set end of PDFs to my (stick on "_" & lang to PDFPath)

        end repeat

        

        tell application "Finder"

                reveal the PDFs

                activate

        end tell

end run

 

# stick

#   Appends a suffix to the filename (without extension) of the file at the

#   specified path, without altering the file extension

to stick on suffix to fp as text

        local fp, suffix

        

        set filename to null

        

        tell (NSString's stringWithString:(fp's POSIX path)) to if ¬

                false = ((the lastPathComponent()'s ¬

                stringByDeletingPathExtension()'s hasSuffix:suffix)) ¬

                as boolean then set filename to ¬

                (((the lastPathComponent()'s ¬

                        stringByDeletingPathExtension()'s ¬

                        stringByAppendingString:suffix))'s ¬

                        stringByAppendingPathExtension:(the ¬

                                pathExtension())) as text

        

        tell application "System Events" to tell the item named fp

                if filename = null then return it as alias

                set dir to its container

                set its name to filename

                return the item named filename in dir as alias

        end tell

end stick

 

# probableLanguageForPDF

#   Obtain the most likely language of a PDF file based on sampling a small

#   number of its pages and returning the most commonly detected language code

on probableLanguageForPDF at PDFPath as text

        local PDFPath

        

        set PDFFileURL to POSIX file (PDFPath's POSIX path) as alias

        

        set PDF to PDFDocument's alloc()'s initWithURL:PDFFileURL

        set PDFTitle to the PDF's documentAttributes()'s |Title| as text

        set N to the PDF's pageCount() as integer

        

        -- Ignore first and last page unless they are the only pages

        set PDFPageNumbers to array(0, N - 1)

        set a to item 2 of (PDFPageNumbers & {0})

        set b to item -2 of ({N - 1} & PDFPageNumbers)

        if a > b then set [a, b] to [b, a]

        

        set langs to NSMutableDictionary's dictionary()

        

        -- The language of the PDF's title

        set PDFTitleLang to (NSLinguisticTagger's ¬

                dominantLanguageForString:PDFTitle)

        langs's setValue:1 forKey:PDFTitleLang

        

        -- Select only a small sample of pages

        -- to obtain a language for each

        repeat with i from a to b by N div samples + 1

                set PDFPage to the (PDF's pageAtIndex:i)

                set PDFPageText to PDFPage's |string|()

                set PDFPageLang to (NSLinguisticTagger's ¬

                        dominantLanguageForString:PDFPageText)

                set [x] to references in {langs's valueForKey:PDFPageLang} & {0}

                (langs's setValue:((x as integer) + 1) forKey:PDFPageLang)

        end repeat

        

        -- The most common language identified

        set lang to the last item of (langs's ¬

                keysSortedByValueUsingSelector:"compare:")

        lang as text

end probableLanguageForPDF

 

# array()

#   Generate a list of consecutive (ascending) integers between +a and +b

on array(a as integer, b as integer)

        local a, b

        

        if a > b then set [a, b] to [b, a]

        

        script |integers|

                property list : {}

        end script

        

        repeat with i from a to b

                set the end of the list of |integers| to i

        end repeat

        

        return the list of |integers|

end array

 

Finally, here's the Automator workflow file, which any future readers for whom this link will be broken will have no problems creating themselves by copy-n-pasting the above script.  It will work as a single AppleScript action inside Automator, or you can copy the action workflow depicted in the image above that simply sets a directory.

 

https://transfer.sh/6CGwf/Append Language To Name of PDF File.workflow.zip

 

@jmm28260, report how things turn out with this iteration, either from Automator, or from Script Editor, or from both.  If Automator fails, report how and why, but then definitely run from within Script Editor to confirm the error is script-generated rather than app-generated, and to get a more focused error report.

Edited by CJK
Important edits made to AppleScript code; updated the download link for the Automator workflow
Link to comment

@CJK Thanks so much for your efforts. I am impressed by your work and thank you for your time.

I have tried using the Script editor in Mojave with your Applescript and here is the error message I get :

 

1608433632_Capturedecran2019-01-20a13_11_08.thumb.png.b1d16eea0303a006657a5ce34a4c68e2.png

 

And I have used the Automator you provided following your instructions precisely, specified a definite directory to look into, where I had put a pdf file, and got the following message:

 

L’action « Exécuter un script AppleScript » a rencontré une erreur : « *** -[__NSDictionaryM setObject:forKey:]: key cannot be nil »

 

Looks like Mojave is reluctant to accept some instructions.

 

 

 

Capture d’écran 2019-01-19 à 18.34.02 copie.png

Link to comment
On 1/20/2019 at 12:27 PM, jmm28260 said:

@CJK Thanks so much for your efforts. I am impressed by your work and thank you for your time.

 

That's ok.  You didn't tag me correctly.  The correctly-tagged name will appear in purple.  When you type @CJK, you should obtain a list of users that contain these letters in their username, and you can click on the appropriate one.

 

On 1/20/2019 at 12:27 PM, jmm28260 said:

I have tried using the Script editor in Mojave with your Applescript and here is the error message I get :

 

1608433632_Capturedecran2019-01-20a13_11_08.thumb.png.b1d16eea0303a006657a5ce34a4c68e2.png

 

Erm...  Yeah, that's because you double-spaced the whole script.  You've got a blank line in between every non-blank line.  I have no idea how you accomplished that.  I did a copy-n-paste test from my previous post and it pasted correctly.

 

On 1/20/2019 at 12:27 PM, jmm28260 said:

L’action « Exécuter un script AppleScript » a rencontré une erreur : « *** -[__NSDictionaryM setObject:forKey:]: key cannot be nil »

 

This is really helpful actually, because it finally pinpoints the nature of the error  and where it's coming from.  This is being generated by the sub-routine (handler) called filterContents.  It uses an Objective-C data class called NSFileManager to access and manipulate filesystem objects.  Basically, NSFileManager needs certain permissions/authorisation to be allowed to access your filesystem, and your system hasn't granted those access rights.  I could play around with getting an authorisation request from the script side of things; or you could play around with physically granting those privileges on a permanent basis.  Neither of these things are things I know how to do off the top of my head, so it's actually easier to get rid of that handler and get Finder to do the job that NSFileManager was doing.

 

To summarise the changes I am making to the script:

 

This handler is now defunct and will be deleted completely:

On 1/18/2019 at 10:02 PM, CJK said:

to filterContents at directory by extension as list

        local directory, extension

        

        set directory to directory's POSIX path

        

        script dir

                property filenames : ((NSFileManager's defaultManager()'s ¬

                        contentsOfDirectoryAtPath:directory ¬

                                |error|:(missing value))'s ¬

                        pathsMatchingExtensions:extension) as list

                property filepaths : {}

        end script

        

        repeat with fp in dir's filenames

                try

                        set end of filepaths in dir to ¬

                                POSIX file (((NSString's ¬

                                        stringWithString:directory)'s ¬

                                        stringByAppendingPathComponent:fp) ¬

                                        as text) as alias

                end try

        end repeat

        

        dir's filepaths

end filterContents

 

 This is where the line of code crops up that makes use of the now-defunct handler.  I will be changing this block:

On 1/18/2019 at 10:02 PM, CJK said:

        if class of filepathslist then

                set directory to POSIX file (POSIX path of filepaths) as alias

                set filepaths to filterContents at directory by ["pdf", "PDF"]

        end if

 

to this:

        if class of filepathslist then

                set directory to POSIX file (POSIX path of filepaths) as alias

                tell application "Finder" to set filepaths to (every file ¬

                        in the directory whose name extension = "PDF") ¬

                        as alias list

        end if

 

This will most likely fix the error.  It now clarifies why the first workflow wasn't working for you, because that one relied on NSFileManager to do all of it's filesystem calls.  Following these edits, the script and workflow will now completely rely on Finder and System Events for its filesystem calls.

 

To avoid getting these posts even more bogged down with long script texts, I'll make the changes to the script in the previous post, so you can copy and paste it from there.  Alternatively, since you had problems copying-and-pasting that code, you can download a copy of the AppleScript file from here: 

https://github.com/ChristoferK/AppleScriptive/blob/master/scripts/Append PDF Language To Filename.applescript


The workflow with the updated script can be obtained here: https://transfer.sh/6CGwf/Append Language To Name of PDF File.workflow.zip

 

I will update the link from the previous post too. 

Link to comment

*Sigh*

 

My guess here is that the property of the PDF's attributes are specific to ones locale, and so your metadata item that holds the document's title is likely not called Title, but possible Titre.   To confirm my hunch, you could change this line:

 

        set PDFTitle to the PDF's documentAttributes()'s |Title| as text

 

to this:

 

        error PDF's documentAttributes() as record

 

which will show you what the attribute names and values are in the error dialog that pops up.  It's also possible that the specific metadata item for the document title wasn't filled in for the particular document you're testing the script on, as they are optional.

 

In both of these scenarios, the fix for it should be to change that line I just highlighted above to this:

 

        set PDFTitle to PDF's documentAttributes()'s ¬

                objectForKey:(PDFDocumentTitleAttribute of this)

 

Using the PDFDocumentTitleAttribute constant should return a value appropriate for your system; and using objectForKey to access the property as an NSDictionary rather than as a property that is accessed through a method will handle properties that don't exist a lot more gracefully by returning missing value.

 

I've already made the change and updated the script file on GitHub.

Edited by CJK
Link to comment

Change these lines:

 

        set PDFTitleLang to (NSLinguisticTagger's ¬

                dominantLanguageForString:PDFTitle)

        langs's setValue:1 forKey:PDFTitleLang

 

to this:

 

        tell PDFTitle to if missing valueit then tell ¬

                (NSLinguisticTagger's dominantLanguageForString:it) to if ¬

               missing valueit then langs's setValue:1 forKey:it

 

File updated on GitHub.

Edited by CJK
Link to comment

Would you mind doing me a favour and just sending me the PDFs that you are using for testing purposes ?  That way, I can see what sorts of content you're dealing with.

 

This latest error, whilst very similar in nature to the one affecting the Title of the PDF document, seems odd that it would arise when dealing with a page from the PDF.  It implies that there was no language detected, which in turn implies the page had no text.  I can perhaps believe this might be the case for one random PDF file, but if you're presumably testing different files, then the others ought to work.

Link to comment

@CJKThanks a lot. I changed file, and it seems to be working now. I really admire your competence and your tenacity!

One last thing that is strange: when I feed the script with italian text, I may get .en with some files  or rightfully so,  .it. Apparently its detection of italian is erratic ... I guess that has nothing to do with the script. Again, thanks a lot and great job!!

Link to comment
1 hour ago, jmm28260 said:

I changed file, and it seems to be working now

 

That's good to hear.  However, it was still an error that occurred in one file, so does require fixing, as there may be other PDFs that have blank pages.  I've updated the file on GitHub with a fix for that, and a couple of other minor adjustments that I also realised were potential weak spots in those sorts of one-off situations.

 

2 hours ago, jmm28260 said:

One last thing that is strange: when I feed the script with italian text, I may get .en with some files  or rightfully so,  .it. Apparently its detection of italian is erratic

 

No idea I'm afraid.  That one will be down to the language tagging function, which isn't something I'm able to change.  Mojave does actually have a  newer language detection class called NLLanguageRecognizer, which I'm uncertain what the differences are between that and the one my script uses.  But I am running High Sierra, so don't have access to the newer one in order to try it out.


Without knowing the content of the PDFs that you were expecting to be tagged as being Italian, I can't judge any probable things to investigate.  But, you could always try increasing the number of pages that are sampled in each PDF.  There's a line in the first section of the script where all the properties are declared:

 

        property samples : 4

 

If you increase this number, the script will use more pages to identify the language.


Just in case you're interested, the current method of decision making for the language is done by detecting the language on each sampled page.  If different pages produce different results, then the script chooses the language that was tagged in the most number of pages from the sample.  So if your PDF had a mixture of Italian and English in it, and the sample pages happened to feature more densely occurring English than Italian, it would tag the file overall as English.


The other way I contemplated doing it was combining the sample pages together into one long piece of text, and detecting the language the that as if it were a single page.  That would always produce one affirmative result.  Then, of course, the last possibility is to remove the sampling altogether and just assess the entire PDF file in one go.  However, I avoided that because some of my PDFs are quite large, and if the script had to convert many PDFs each with hundreds of pages, that's obviously going to slow it done compared to just sampling 4 pages from each PDF without necessarily producing better results.


Anyway, if you have a particular preference over which method should be used, just let me know and I can change the implementation quite easily.

Link to comment

@CJK Thanks so much for this terrific job. I really appreciate your time and efforts.

After getting your last version, I ran it and I think something was deleted in your procedure, since it produced no file at all. I have included it also in my Automator workflow, and all the actions preceding the Applescript got results, but nothing came out of the script. No results shown.

https://www.dropbox.com/sh/q8rp4cs8lmi7zb4/AADpmmhTSHL4wqNLldH6fuSla?dl=0

Would you mind checking it ?

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...