Funding from an US university has been secured to complete the TILT tool to a finished state. In the next few months I will be completing the editing GUI and also add a viewing tool for pages recognised using the TILT tool. This will allow texts to be scrolled in sync with the source images, linked at the word-level. This should allow very smooth scrolling. I will also be writing up a publication for a HCI-type journal to explain the design in preparation for user-testing.
Thursday, May 7, 2015
TILT has now been thoroughly revised for each stage except the final linking. The result is better separation of text from the image, better line and word-recognition. It is almost at the point now where for many manuscripts the automatic output will be good enough for a first pass set of text-to-image links. Here's the automatic results for the difficult De Roberto V2 manuscript of The Viceroys. Note the good automatic word-identification. This will be refined further in the linking step. By reference to the real text these shapes represents it will be possible to split some of the merged words, though not those where there is overwriting, of course. The main difference is seen in the way that words are restricted to their own line-region, which is now polygonal rather than rectangular. The latter strategy was too simple, and led to words being recognised that crossed line-boundaries. Joined ascenders and descenders are all too common in manuscripts.
Monday, April 13, 2015
The first step in trying to recognise words on a page is to obtain a good back and white representation of the colour original. Typically what you get with manuscript images, once you follow the standard binarisation techniques (like Sauvola's) is that thin pen-strokes disappear and the writing breaks up. This makes recognition of lines and words very difficult. Here's part of the Brewster journal (Biodiversity Heritage Library) rendered using the Ocropus toolset:
As you can see, the thin pen-strokes have broken up and the text is difficult to read. But then I realised that the information about these thin strokes was still in the original greyscale image. By comparing the local broken-characters with that I should be able to extend them so long as they were darker than the local pixel density. For this to work I had to create several copies of the image:
- A gaussian blurred version of the original greyscale, with a blur radius of 1/80th of the image height.
- The greyscale image
- The regular binarised image
- A mask, generated by blurring the binarised image, and rendering all the blurred pixels as pure black
By examining each pixel in 3. if it was black I recomputed the "blob" of connected pixels directly from the greyscale image at the same coordinates. To decide whether it should be black or white I used the value at the same coordinates in 1, which is effectively the local average pixel value. Since text and background will be included in the calculation the average at each point is likely to be somewhat higher than the background density, so any pixels at least that dark and in the vicinity of the originally recognised text are very likely to be the missing fragments of letters. Judge for yourself:
To stop this extension bleeding into the surrounding parts of the greyscale image I used a mask to restrict how far the local extensions would go. The blur radius I used was 1/200th of the image height. I've tried it on printed books, typescripts and manuscripts and it seems to work quite well. The beauty of this is how simple it is: no machine learning, no fancy transformations. The small lines under the words could be removed by a blue-filter on the original colour image. But I'm only interested in word-recognition, not recognition of letters, so mostly these lines do no damage.
Wednesday, February 25, 2015
TILT finally has an editing GUI of sorts. Although it doesn't do much yet, over the next few days it should acquire all the new parts that have been created in the past two weeks, and incorporate the TILT back-end service to recognise pages dynamically. It will also allow the user to edit and save alignments, which have mostly been produced automatically. So far all it can do is switch between justified view of the text and line-by-line mode, and zoom the image. But as can be seen from the buttons on the left, there is plenty to come.
Friday, February 20, 2015
One of the problems I described in the last entry was slicing a polygon in two. Doing this required a pretty good understanding of high-school geometry. But it now works in the demo on this page.
But having more than one polygon creates a new problem. As you move your mouse over the page you have to decide over which polygon you are moving, or on which corner-point you are clicking. And there might be hundreds of polygons or points on a page, and you have only a few milliseconds to decide between them. Imagine that you have some way to test "is the point where the mouse is inside this polygon"? or "is the mouse over a corner point of this polygon"? If you have,say, 100 polygons, you will have to call those two tests 100 times whenever the mouse moves even a small distance. And that will be much too slow if you want the interface to be responsive.
So I decided to divide the page-image up into four rectangles (NW, NE, SW, SE) containing either at most four points, or nothing. If there is nothing then the rectangle acts as a container for four smaller rectangles that are inside it. And as the mouse can only be in one place at a time it can only ever be over one rectangle that has any points. As well as the points each such rectangle also contains a list of polygons that overlap with some part of it. Deciding which rectangle you are in is easy because they are nested. So now it is a simple matter to test "is the mouse currently over a polygon or a corner-point?" because there will only be a few of them in each rectangle. The only problem with this is that as you edit the polygons and the points the rectangles must be kept up to date. But that is a solvable problem. Click on the image below to see it in action:
So now what I have is point-delete, point-add, point-drag, and polygon-anchor (freeze and highlight points), polygon-highlight and polygon-split. That's enough to try to create a usable GUI for editing the output of the TILT recognition process. And that, of course, is the next step.
Sunday, February 15, 2015
I have refined the test program in the previous post to add points. So now you can both add and delete as well as move points. Some would argue that this is enough. But there are some tools that would greatly speedup editing that no one else seems to have thought about yet:
- Slicing a polygon in two. Imagine that you have a polygon that covers several words. You need to cut it quickly into two. With just the ability to delete and add points to existing polygons (no new polygons) how else can you do that? With the mouse all you would need to do would be to drag a line over an anchored polygon, then release the mouse to slice it along that line.
- Merge two or more polygons. If you have fragmented polygons it would be great to just shift-click them and then merged them in a single stroke of the mouse. This could be done by dragging from inside one of the polygons and dragging across the ones to be merged, ending inside another polygon. Then all dragged-over polygons could be merged.
- Create blobs. By clicking on a region that has no polygon you could send a message to the service to try to recognise a word in one go.
I've nearly got 1. to work.1 and 2 are a bit counter-intuitive because dragging in drawing programs is supposed to draw a square marquee. But marquees are just not very useful in this case, so I think overriding the default is a good idea. We need to facilitate the operations that the editor of a set of polygons will use all the time, or it will quickly become tedious.