| View previous topic :: View next topic |
| Author |
Message |
the saint

Joined: 09 Dec 2003 Location: not there yet...
|
Posted: Tue Mar 08, 2005 12:13 am Post subject: PDF OCR if you don't understand these terms don't bother |
|
|
Ages ago, when I finished my MA, I scanned all my materials so as to be able to transport them without lugging six huge folders around. I have them at 100dpi as .gif files in black and white which makes them easily legible. However, using Acrobat, I find that, trying to get OCR to work on them fails because they say the minimum is something like 144dpi for OCR to work. This means it is pretty inconvenient to search the reams of paper for particular terms when I'm writing up research and need to do some reading.
Has anyone any idea if it is possible to increase the dpi of an image once it is scanned? Alternatively, does anyone have any idea how else I might be able to get this text scanned at this resolution?
Really wish I hadn't skimped on it now...
Last edited by the saint on Tue Mar 08, 2005 12:50 pm; edited 1 time in total |
|
| Back to top |
|
 |
Bunnymonster

Joined: 16 Mar 2004 Location: Tokyo
|
Posted: Tue Mar 08, 2005 3:32 am Post subject: |
|
|
Which software are you using to try and OCR the image with? The OCR feature in acrobat itself is horrid to say the least. I'd 'recommend' Panagons capture mainly as it enances the documents and lets you add index fields automatically before releasing them into filenet which is n awesome document management system. However it is a professional bit of kit and might well be very expensive to buy.
You can increase the resolution of an image easily enough (photshop can do it easily from resize image) however if there isn't enough information for the program to work with you will have to wade through it checking for errors.
May I suggest, depending on what you actually want it for, that converting the .gifs to PDF form and then indexing the PDFs might be an easier and more practical solution than trying to painstakingly OCR the whole documents? If you do choose to go the OCR route it is probably quicker and easier to get a professional to do this as they will have the heavy duty scanners and will have their software setup down to fine art.
This field was my job for 2 years so if you need more specific information regarding solutions to your problem let me know. |
|
| Back to top |
|
 |
Bulsajo

Joined: 16 Jan 2003
|
Posted: Tue Mar 08, 2005 5:10 am Post subject: |
|
|
| I hate OCR programs- they never seem to work without requiring a lot of reformatting and re-typing. I wonder if Panagon has a trial version... |
|
| Back to top |
|
 |
Bunnymonster

Joined: 16 Mar 2004 Location: Tokyo
|
Posted: Tue Mar 08, 2005 6:37 am Post subject: |
|
|
Panagon Capture is part of FileNet if you want to look for it. Its a professional piece of software and really not designed for home use. Its pretty powerful stuff but even then the OCR is not even close to perfect. We reckoned on getting anything above 70% of characters correct being absolutely amazing though this was on handwritten documents (printed text was better). We never wanted the formatting retained though and this is somewhere OCR software really struggles. The thing that makes panagon capture so powerful (beyond the way it integrates with the rest of filenet) is the control it gives you over what to do with errors, along with the ability to customise the cleanup process in limitless ways, (you could write custom scripts for it).
The more I think about this the more creating a set of .Pdf files and a proper index, which can then be managed as a database would be the best solution to what I think the OP wants, that said I might be way off. What creating this index will let you do is search for key terms, then pull up the documents which match this criteria. This is both more and less powerful than an OCR word search in that an OCR word search will give you many irrelevant results and omit relevant ones as the search word does not feature in the text, that said you will only get taken to the relevant page and will not be able to find exactly where the word occurs.
You might want to run an OCR pass over the documents and assign these words into an index field (which was what filenet does by default) which will let you serch for words but it would not be my primary method of searching for the reasons outlined above, and you definitely want to set up the manager so it brings up the original scans of the pages not the 'OCRified' documents. |
|
| Back to top |
|
 |
the saint

Joined: 09 Dec 2003 Location: not there yet...
|
Posted: Tue Mar 08, 2005 12:48 pm Post subject: |
|
|
Thanks Bunny...
what I have been doing is precisely what you thought... using Acrobat to OCR the gifs. Unfortunately, it won't do it if the res is below 144dpi which many of my gifs are. Yeah, I could just manually go through and add bookmarks to use these to search by but this will take me a long time as I have hundreds of pages. I've found the OCR in Acrobat 7 to be almost perfect. Let me clarify that by saying that I'm not asking it to convert image to text. Acrobat rather allows you to leave the original image while a text version resides behind the image in a hidden layer. This means you can effectively search an image for text terms. Sifting irrelevant results is quite easy because the search dialogue in Acrobat 7.0 shows the search terms in context. It is a simple matter to find the ones you are after.
What I am interested in bunny is how I can increase the dpi using Photoshop or even Paintshop Pro which I'm far more familiar with. Is this possible? If so, how, cos so far I've been unable to find out how to do it.
Cheers |
|
| Back to top |
|
 |
Bunnymonster

Joined: 16 Mar 2004 Location: Tokyo
|
Posted: Tue Mar 08, 2005 4:24 pm Post subject: |
|
|
In photoshop go to Edit->image size -> enter the new resolution in the box.
In Paintshop Pro it is also in the resize image box. I have no idea which menu its in but it is there somewhere............
Hope that helps. |
|
| Back to top |
|
 |
SuperHero

Joined: 10 Dec 2003 Location: Superhero Hideout
|
Posted: Tue Mar 08, 2005 11:00 pm Post subject: |
|
|
| Bulsajo wrote: |
| I hate OCR programs- they never seem to work without requiring a lot of reformatting and re-typing. I wonder if Panagon has a trial version... |
abbey finereader makes very few errors in my experience. |
|
| Back to top |
|
 |
the saint

Joined: 09 Dec 2003 Location: not there yet...
|
Posted: Wed Mar 09, 2005 12:17 pm Post subject: |
|
|
| Bunnymonster wrote: |
In photoshop go to Edit->image size -> enter the new resolution in the box.
In Paintshop Pro it is also in the resize image box. I have no idea which menu its in but it is there somewhere............
Hope that helps. |
Tried that in Paintshop Pro and you were right it is in the image resize box. Exported the gif and overwrote the orig file with a 200dpi one. SImply didn't work. When I attempted OCR, got the same message that the res is below 144dpi.
Hmmmm  |
|
| Back to top |
|
 |
Demophobe

Joined: 17 May 2004
|
Posted: Wed Mar 09, 2005 3:03 pm Post subject: |
|
|
Yes. Resizing the .gif won't do anything for the dpi. In order to change this, the image needs to be re-sampled, or in a sense, re-scanned.
The following is from here.
PaintShop Pro 7.0
Click "File", "Open" and open the image
Click "Image", "Resize"
Check "Actual/print size" and enter "300" in "Resolution" and set your print dimensions
Set "Resize type" to "Bicubic resample"
Check "Resize all layers" and "Maintain aspect ratio...", then click "OK"
You are now ready to apply text and photos..?
When resizing print size after resampling an image to a higher resolution, resize only the height or the width with "Maintain aspect ratio...." checked so that it will be resized to the scale of the original image. Otherwise the "Resize" will result in a certain amount of image distortion rather than maintaining the scale of the original image.
Another technique to clarifying detail (is you are using Paint Shop Pro) is to use "Unsharp Mask" (radius "3", strength "65", clipping "16" - adjust based on your preferences). If you are experiencing a color wash when printing, you can increase color saturation by using "Hue Map" and setting "Saturation Shift" to "19" (or lower depending on your preferences).
You may need to resize the image to fit the new layout. The goal in resizing images is to retain as much image clarity as possible. If you lose image clarity after a resize, undo the resize and do the following:
Apply Effects >> Noise >> Edge Preserving Smooth (set this to "1" for a 100 dpi image, "7" for a 300 dpi image, etc.). Now resize your image to the size needed for your project. If you find that you still need to correct the image after resizing, apply Effects >> Sharpen >> Unsharp Mask and modify the settings until you are satisfied with the clarity of the image. |
|
| Back to top |
|
 |
the saint

Joined: 09 Dec 2003 Location: not there yet...
|
Posted: Wed Mar 09, 2005 11:02 pm Post subject: |
|
|
Demo,thanks, as always, for your detailed help.
I have tried everything to resample these images (which actually are at 75dpi) and every time I do it and save them and open them in Acrobat to convert to pdf and scan, I get told that the dpi is still below 144.
Not sure why this is and I'm beginning to think that the whole thing is probably not worth the effort if such detailed editing needs to be applied to each of these hundreds of images.
So, unless anyone has any last ditch solutions to why Acrobat still doesn't recognise the files as anything more than 144dpi, I'll give in... |
|
| Back to top |
|
 |
|