Ruby, Folder actions and full automation
I routinely scan my documents as PDFs so I can keep them in a virtual filing cabinet (you know, the whole “paperless office” thing). I use my HP all-in-one software running on a Windows VM inside a Mac (sorry, but the Mac scanning software on HP is complete garbage in my opinion).
What bothered me about this was that all the files scanned always end up named “scan12345.pdf”. Because of the way I file, I like having my things as “year/company/year-monty-date.pdf” instead.
For a while there I was spending hours manually moving files to the correct folders while watching TV or doing another “using half-my-brain” activity. I thought there had to be a better way. I’m a bit handy with software, so I started a little ruby script to do the move for me.
The script uses pdftotext (you can get pdftotext using MacPorts ) and ruby to determine information inside the actual PDF (scan them as “searchable PDFs” so it OCRs the text). Once it makes that determination, it does the file move to the appropriate place. A series of Regular expressions inside the script determine what company it belongs to, and a strong date parser takes a look at anything that looks significant enough that has a date attached to it in order to determine an appropriate date IN the scanned document, as opposed to the date of the file (which it uses if it can’t find any parsable dates inside the document).
So now I had a script that could do those moves properly, but still didn’t want to have to remember to continuously running the script. Here is where the Mac’s Folder Actions feature comes in.
Using Folder Actions, I wrote the following script to wait for the file to drop in the folder (the scanning program on the Windows side uses a VMWare shared folder to drop the PDF on a mac folder):
It’s not perfect. It uses a delay mechanism to wait for the Windows side to finish writing the PDF (dumb scanning program creates the zero-byte file and waits to fill it until it has run OCR on the whole thing). But because the folder action works on all PDFs on the document, it can pick up the ones it couldn’t run pdftotext on the last time, so it’s good enough for now. It also has a real problem pulling the correct date on documents where a lot of patterns could be a date (need to work on a good algorithm – I haven’t found one yet).
Hopefully it will be useful for you if you have a similar need or just for learning how you can streamline some of the stuff you do every day with your Mac. Cheers!