Tesseract 4 has got a new long short-term memory neural networking based
OCR engine which really helps a lot in terms of accuracy and our VM
tests.
I ran the new version across a bunch of different screenshots and
comparing the results to the 3.x branch and it really makes a big
difference, especially with various font rendering settings.
The only downside of this is that version 4 hasn't been released yet and
is in alpha state right now, but it will eventually get there and the
only solutions that came into my mind sticking to version 3 were really
sub-par:
* Use several passes with different color negation on the screenshots.
* Train Tesseract 3 specifically for screenshots. This is sub-par
because we'd need to do it for Tesseract 4 from scratch again.
* Change the test systems so that it specifically uses *only* OCR an
font when displaying. I've actually tried this but this also isn't
accurate enough with our default font rendering setup.
* Turn off special font rendering settings for our tests. In
conjunction with changing to an OCR font this might work but it won't
catch all the cases, because applications might use their own font
rendering.
Given that version 4 is faster[1] when it comes to OCR detection and also
the points just mentioned I think even using the alpha version just for
tests isn't going to hurt anybody.
[1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance
Signed-off-by: aszlig <aszlig@redmoonstudios.org>
The changes are a bit too big to include it here in the commit message,
so if you want the details of what changed, please visit this URL:
http://leptonica.org/source/version-notes.html
I have also provided openjpeg, giflib and libwebp as dependencies so
that Leptonica is able to read/write those file formats.
Additionally I've added a patch that uses pkgconfig to resolve all
dependencies (except giflib), because unlike AC_CHECK_LIB() the
PKG_CHECK_MODULES() macro defines *_LIBS variables to include the linker
search path.
Unfortunately that patch alone is not enough, because the *_LIBS
variable are substituted by the upstream configure.ac to *not* include
the linker search paths, so we need to remove the AC_SUBST() calls
within PKG_CHECK_MODULES().
The only dependency that's not yet using PKG_CHECK_MODULES() is giflib,
because giflib doesn't have a pkg-config description file, therefore
we're using substituteInPlace to insert the linker search path after the
lept.pc file was generated by configure.
Another thing that we no longer need is the dependency on libpng version
1.2, because Leptonica now also works with more recent libpng versions.
Tested by building the package itself and also the following packages
that immediately depend on leptonica:
* k2pdfopt
* tesseract
* jbig2enc
All of these packages succeeded to build on x86_64-linux.
The main reason why I'm bumping Leptonica to version 1.74.1 is that we
need at least version 1.74 to bump Tesseract to the latest upstream
version.
Signed-off-by: aszlig <aszlig@redmoonstudios.org>
There are a few dozen new failures on Darwin, probably related to
updates of stdenv's llvm and/or pkgconfig.
Still the total number of successes increases.
Including apple_sdk.sdk is generally a recipe for a bad time on LLVM 3.8
and above, since you end up with bad headers in the wrong place that hurt
the new libc++ in 3.8 and above. In this case, qt only wanted the super-
generic SDK for CUPS headers, which we can just depend on directly now.