Sunday, November 9, 2014

First steps with user interface for TILT

One of the key requests that emerged from the BL Labs seminar last Monday (3/11/14) was the need for a usable GUI to control TILT. I wanted something efficient and light-weight that would last more than a few months, so I set about trying to build the GUI using only standard Web-technologies – just jQuery and Javascript – no 'plugins'. My experience with the latter is that, once their creators have moved on, they don't tend to be updated, and are quickly replaced by the next latest fad. Also, learning how to use them, and more often than not discovering half-way through that they are missing some key feature often make more work than just writing your own solution from scratch. So, I thought I would solve the most difficult problem first: how to represent and edit polygons on screen.

What it currently lacks is the ability to add and subtract points. After that is added, I need to make it load the polygons from an entire page of them generated by TILT. But one step at a time.

Tuesday, October 14, 2014

Improved Line recognition

I managed to improve line-recognition, which has been plaguing the effectiveness of later stages. Until you can do that in manuscripts, with lines that are not horizontal, evenly spaced or straight, you have no chance of even recognising words reliably. My previous method, by first dividing the pages into small rectangles, could detect the same line two or three times over - for example, once for the ascenders, descenders and the main body of the line. The new method first blurs the image, then subdivides it into narrow vertical strips. The strips are then reduced to a single pixel in width by averaging them horizontally. This produces a graph that indicates the rise and fall of blackness within the strip. But since the data has many small peaks and troughs that aren't really interesting, I first apply a smoothing function before trying to detect the main peaks of blackness. These will very likely correspond to the black lines of type or writing. The final step is to join up the lines detected in the strips by horizontally aligning them as before. The result is very good line-recognition on most of the examples. Here's how the De Roberto manuscript, which is fairly average in difficulty, looks with the lines recognised on top of the blurred image:

A side-effect of this approach is that it should improve word-recognition, not only by helping to locate words, but also by joining up word-fragments through blurring. However, I'm running out of time now as the deadline of November 3 looms.

Monday, September 15, 2014

Defining sub-regions of images

One of the things that often happens in manuscripts is that certain areas bear writing and the rest of the image can be more or less ignored. In some files I was sent recently by a group interested in TILT, most of the page, handwritten on the verso, is taken up by writing that shows through from the recto. Other cases include decorative borders or letterheads that ideally need to be excluded. Doubtless there are automatic methods to deal with some of these. The question is, how often do they go wrong? And are they not designed to work on regularly laid-out pages in printed books? What about manuscripts where the writing goes right into the margin? In such cases any automatic technique is likely to do more damage than it repairs. Manually specifying these active regions is worth the small effort it costs if the result is much improved word-recognition and text-to-image linking. All that was needed was to 'white-out' the regions of no interest, and so the same procedure as before would ignore those areas.

A case in point was my Harpur typescript, pasted into a scrapbook. Word recognition worked poorly because of interference between the text and residual elements on the border. Now it as good as the best cases. But this does mean that it is not entirely automatic.

Friday, August 1, 2014

Woo-hoo! It works!

I always like to capture the joyful moment when something finally 'works'. It makes all the labour seem worthwhile, even if, to an untrained eye, it would appear to suffer yet from many deficiencies. So to cut to the chase: the demo site now links the page image one word at a time to the text on the right, and vice versa. Just select an example and click on "upload", then "link". The best example is the Harpur Sonnets. There are problems with splitting shapes that belong to several words (try splitting a polygon some time), and also there are many other deficiencies: for example, I don't like the shape of the polygons – they're ugly convex ones and I want concave ones that surround the word elegantly. And I am painfully aware that my word and line-recognition modules still need some work. But all these improvements and others can be comfortably consigned to 'future work'.

The total development time from start to this point has been around 6 and a half weeks, part-time work for one programmer. When you compare other projects that have worked on this same problem, and didn't get as far, and how much they cost, that's pretty damn fast.

Addendum: There is now a better version but it is much slower. The problem is that words get recognised on the wrong lines. Once that is fixed it should work OK on all the examples. But I won't upload a new version until I'm happy with the speed.

Wednesday, July 9, 2014

Getting word-spacing right

As mentioned in the previous post the hardest thing to get right in TILT is to accurately estimate the minimum space between words. A little reflection will show that manuscripts, typescripts and printed texts all employ very different conventions on the use of spaces between words. How can you estimate word-spaces in manuscripts? It's hopeless, surely?

In fact, there is a trivial solution. A page-image is made up of 'blobs', that is pixels that are joined together. Wrapping each such blob in a polygon allows you to compute the distance between blobs on a line. In a printed text there will be one blob per letter. In a manuscript, because of joined-up writing, there will be lots of characters per blob. And then every now and again there will be a gap between blobs that is not a word-space. So how can these informal gaps be distinguished from real word-spaces? Another problem is that there are columns where the inter-word gap is measured in hundreds of pixels. Just measuring gaps in a line and averaging the result thus has no hope of success. How can these huge inter-word gaps be excluded? But then I realised that the number of words on a page is already known, because TILT needs the text of each page to align it with the image. So all I had to do was to find all the gaps on a page, then sort them by decreasing size. By assigning one gap to each word in the text the last one chosen would be the width of the minimum word-gap.

This works so well, I have updated the test script to show it. It will work for any page from a manuscript, typescript, inscription or printed book. The only gotcha is that you must know the number of words, or use an estimate. Also it can never be perfect. There is no one setting for a minimal word-gap, since an author can write two words with less than this separation and separate two halves of one word with greater than this. But it's still an optimal solution.

Just to give you some idea of how much the minimum word-gap varies between the test examples:

TypeAuthorNumber of wordsMinimum word-space (pixels)
TypescriptHarpur1504
PrintedDe Roberto29112
ManuscriptDe Roberto3536
ManuscriptCapuana2057

Now for the text-to-word alignment. That's the last stage.

Sunday, July 6, 2014

TILT recognises words in manuscripts, typescripts, books

The next milestone has been reached. TILT can now recognise words on the lines identified earlier with reasonable accuracy. What it does is pretty simple: it just looks for black text in a strip on either side of the lines identified in the previous step and then extends any black lines discovered outwards. It finally draws a polygon around the discovered word.

The next step will be to refine these words so they represent as closely as possible the words in the transcribed text. Then it should be easy to align them with the words in the transcription of the page (which we often already have) and hand over to Anna for development of the GUI. Here's a sample of TILT's current performance using polygons. These can be reduced to rectangles easily if desired.

The main problem in recognising words turned out to be the different way that spaces are used in printed and handwritten texts. In the former there are lots of little inter-character gaps that mostly aren't present in manuscripts. Try as I might I couldn't find a single setting that worked well for both. These images show the performance on a typescript, manuscript and a printed page. The colours are used alternately to show where word-divisions have been recognised. To get this performance in practice the GUI will have to specify the image type.

The next stage has some ability to split/merge words, based on their alignment with a known text, but it would be better if a good word identification can be attained at this stage.

Friday, June 27, 2014

TILT recognises lines in manuscript/print books

Perhaps the hardest thing to get right in the TILT design is to reliably recognise lines on a page where division into lines may be irregular. For example, you can easily have uneven line-spacing, inserted, warped and tilted lines, but in order to recognise the words on a page you first have to work out roughly here they are. TILT has shown, early in its life, that the basic idea for its line-recognition method works. There is a live demo here. It is slow only because the server is slow. TILT is actually pretty fast. Once you've loaded a page you can click on some buttons to see how TILT processes a page-image. First it reduces it to greyscale, then to pure black and white, then it removes residual borders (which are ordinary OCR steps). Finally it searches the page for lines, using a grid of rectangles, about 25 across and 200 down. The reason for this strange proportion is that lines are pretty much shaped that way. So if lines slant down or up or curve, it should be able to track their progress across the page. So far it has demonstrated that it can discover small lines in-between the main ones. Ordinary OCR programs can't do this. They assume that text has evenly-spaced lines. TILT's test interface draws a line over the top of each line of text just for the demonstration but in the real product these lines will be invisible. Along this line it will later attempt to recognise words, and to align those words with the already transcribed text. But this step brings that much closer.

What this makes possible is the offline processing of large numbers of page-images, creating page-image to text links, which can then be uploaded. They won't be perfect without editing (which is what the graphical user interface is needed for) but for a first pass it will suffice for now.

Wednesday, June 11, 2014

TILT is born again

This is a fresh start for the text-to-image linking tool (TILT). TILT is a tool for linking areas on a page-image taken from an old book, be it manuscript or print, and a clear transcription of its contents. As we rely more and more on the Web there is a danger that we will leave behind the great achievements of our ancestors in written form over the past 4,000 years. On the Web what happens to all those printed books, handwritten manuscripts on paper, vellum, papyrus, stone, or clay tablets etc.? Can we only see and study them by actually visiting a library or museum? Or is there some way that they can come to us, so they can be properly searched and studied, commented on and examined by anyone with a computer and an Internet link?

So how do we go about that? Haven't Google and others already put lots of old books onto the Web by scanning images of pages and their contents using OCR (optical character recognition)? Sure they have, and I don't mean to play down the significance of that, but for objects of greater than usual interest you need a lot more than mere page-images and unchecked OCR of its contents. For a start you can't OCR manuscripts, or not well enough at least. And OCR of even old printed books produces lots of errors. Laying the text directly on top of the page-images means that you can't see the transcription to verify its accuracy. Although you can search it you can't comment on it, format or edit it. And in an electronic world, where we expect so much more of a Web page than for it merely to sit there dumbly to be stared at, the first step in making the content more useful and interactive is to separate the transcription from the page-images.

Page-image and content side by side

Page images are useful because they show the true nature of the original artefact. Not so for transcriptions. These are composed of mere symbols that, by convention, were chosen to represent the contents of writing. You can't use just text on a line to represent complex mathematical formulae, drawings or wood-cuts, the typography, layout, or the underlying medium. So you still need an image of the original to provide supplementary information, and not least because you might want to verify that the transcription is a true representation of it. So the only practical way to do this is to put the transcription next to the image.

Now the problems start. One of the principles of HCI (human-computer interaction) design is that you have to to minimise the effort or ‘excise’ as the user goes about doing his or her tasks. And putting the text next to the image creates a host of problems that increase excise dramatically.

As the user scrolls down the transcription, reading it, at some point the page-image will need refreshing. And likewise if the user moves on to another page image, the transcription will have to move down also. So some linkage between the two is already needed even at the page-level of granularity.

And if the text is reformatted for the screen, perhaps on a small device like a tablet or a mobile phone, the line-breaks will be different from the original. So even if the printed text is perfectly clear, it won't be clear, as you read the transcription, where the corresponding part of the image is. You may say that this is easily solved by enforcing line-breaks exactly as they are in the original. But if you do that and the lines don't fit in the available width – and remember that half the screen is already taken up with the page-image – then the ends of each enforced line must wrap around onto the next line, or else they will become invisible off to the right. Either way it is pretty ugly and not at all readable. And consider also that the line height, or distance between lines in the transcription can never match that of the page-image. So at best you'll struggle to align even one line at a time in both halves of the display.

So what's the answer? It is, as several others have already pointed out (e.g. TILE, TBLE, EPT), to link the transcription to the page-image at the word-level. As the user moves the mouse over, or taps on, a word in the image or in the transcription the corresponding word can be highlighted in the other half of the display, even when the word is split over a line. And if needed the transcription can be scrolled up or down so that it automatically aligns with the word on the page. And now the ‘excise’ drops back to a low level.

Making it practical

The technology already exists to make these links, but the problem is, how? Creating them by hand is incredibly time-consuming and also very dull work. So automation is the key to making it work in practice. The idea of TILT is to make this task as easy and fast as possible, so we can create hundreds or thousands of such text-to-image linked pages at low cost, and make all this material truly accessible and usable. The old TILT was written at great speed for a conference in 2013. What it did well was outline how the process could be automated, but it had a number of drawbacks that can, now they are understood properly, be remedied in the next version. So this blog is to be a record of our attempts to make TILT into a practical tool. The British Library Labs ran a competition recently and we were one of two winners. They are providing us with support, materials and some publicity for the project. We aim to have it finished in demonstrable and usable form by October 2014.