2024/05/13

From HypertWiki
Jump to navigation Jump to search
Monday, May 13, 2024 (#134)
Woozle's Journal
Sunday Monday Tuesday
Exact day: category (1) This month: category (6) / page
Other years: category (3) This year: category (7) / page

Harena needed to snag the text from some images, and I thought: Shirley, by now there must be a usable GUI OCR package in Ubuntu.

So I searched apt for "OCR", and... after wading through a bunch of stuff, found a thing called YAGF which is apparently a front-end for either of two CLI apps called Cuneiform and Tesseract (with the latter being the default, and something I had apparently installed earlier).

After loading up a sample image to scan, the following sequence of following events followed, following my loading up of a sample image to scan:

  • It claimed I hadn't installed the English language data files for Tesseract, which I had.
  • I tried re-installing them, then closing and reopening YAGF; no joy.
  • After much searching the web for help, I noticed that the Settings dialog box asks for the location of the Tesseract data files, which had been set to the root folder.
    • It seems somewhat unhelpful to complain about not being able to find files when the location for those files obviously hasn't been set.
    • I looked in apt to see where the files might be, and found them in /usr/share/tesseract-ocr/5/tessdata.
  • After this, YAGF would do what appeared to be an attempt to parse the image, but it lasted less than a second and produced no output.
    • I wasn't sure if YAGF would definitely show me the output, so I did a "File → Save All Text". The resulting file had zero bytes.
    • I tried modifying the path, in case YAGF was expecting an enclosing or enclosed folder (but wasn't upset to the point of giving me the error again) -- no go.
  • Then I thought of switching to Cuneiform, in case that works better -- which (after installing it) it does.
    • Running the scan with Cuneiform without installing Cuneiform doesn't give an error message. This is also unhelpful.