AI dilemmas

Our February 2026 newsletter featured the  “Local Images” website built for Manchester Central Library (MCL), which we launched about a year ago but recently enhanced with metadata derived using what is often vaguely referred to as “AI” (including in our newsletter). Although the whole newsletter was devoted to the topic, in the interest of brevity we focused mainly on the accessibility and discoverability improvements that the work had enabled, and we barely touched on issues such as technology, ethics or validation. These are super-hot and interesting topics, though, as evidenced by the fact that we soon had an email from Tim Burge, who we have worked with previously and whose curiosity about the topic come as no surprise! Tim was happy to see a positive example of how AI can genuinely be useful in enriching museum datasets, but he had some questions too. Over to him:

I wonder how many false positives get thrown in? I’m intrigued how this pic for example got tagged as ‘baby’. Did AI think the handbag was a baby!!!!!???

Was there any sort of internal staff checking?

These are very good questions, and the answers in this scenario might well be inappropriate for another. We’d certainly been aware of some imperfections in the responses from the Azure Vision service, so did we investigate how frequent they were, did we consider them to be a concern, and why?

False positives

There are certainly false positives in the labels. The example of “baby” is one where I’ve seen similar oddities myself, or with terms like “toddler”. Unfortunately the Azure response for the image labelling API provides only the label and a probability score, whereas the OCR response (looking for text in the image) also includes coordinates for all of the text elements it has found. That’s probably because “labels” are not just the objects that it sees, but topics/themes that pertain to the whole image, so providing a location isn’t appropriate.

How many are there? I really don’t know. This is a difficult thing to test manually, and who would trust a computer to mark its own work? It would be possible, of course, to do some sort of sampling and come up with an estimate for the percentage of iffy or plain wrong tags, but for the reasons I’ll come to later it wasn’t even considered as something worth doing.

QA/testing

So what quality assurance did we (TMP and the Library) actually undertake? The only real testing was at the start, when we tried a few services (from Google, IBM and Microsoft) on a sample of a dozen or so varied images and evaluated the results for captioning, labelling and OCR. We agreed with MCL which was the best and rolled with that.

When the data enrichments were in place, and we’d done some user interface work to surface them, the Library were very enthusiastic to try it out, but they weren’t really in a position to test in depth following the departure of the project officer who had shepherded the website built. On the TMP side we already knew that there would be errors in all three jobs we got Azure Vision to do (tagging, OCR and captioning for ALT text), but the only one we were really concerned about was the captioning, where we sampled in greater depth. The key question was…

…what actually mattered (in this context)

Our collective feeling was that the important thing was to improve discoverability and usability, NOT to pretend that this was documentation1. The enrichments were there to help people to find their way to the things that they are interested in – only in the case of the captions did we consider that users would need to be able to trust them to actually tell them about the image in question, once they had found it.

Records on the site, including the example that Tim found, are generally thinly documented, which was in fact a primary motivation for our approach to the site and this piece of work. They may have a creator, production date, identifier, subject, collection name and a title, and sometimes a very brief description that often has nothing to do with the content of the image. All of this makes it really hard to find things with text search, or even with the subjects as filters.

Our goal was to improve that, and it didn’t matter if sometimes you were suggested records that turned out not to be on-point as long as you generally had a better chance of finding what mattered. False positives in this context aren’t a big deal as long as more good stuff floats to the top, but it does depend on which enrichment we’re talking about.

The OCR – finding text in the images – is like magic when it comes to surfacing hidden treasures based on aspects that are important to people but were less salient to the official photographers and archivists who documented them for wholly different purposes. The OCR is full of errors, both where it has missed chunks of text that the human eye can readily see, and where it has inaccurately transcribed other parts. But this is very much outweighed by the fact that you can find things you’d be hard pushed to find otherwise. The vast majority of these records don’t actually mention the Abbey National, for instance, but the images themselves do, and it would be the work of days or weeks to find them amongst 100,000 other images if it wasn’t for the hidden OCR data. But we don’t surface OCR text on the page: it’s too flaky for this, and for most people it is much easier simply to look at the image. It’s also pretty meaningless as it’s presented because of how it is broken down in the API response.

Tags also help finding things by attributes that may not have been catalogued, but we decided to surface them on the record pages and in the search itself. It’s a good question whether or not we should do this, though. We could have left those tags off the page and out of the search filters, and they would still have helped the keyword search, but we decided to make them visible (labelled as “AI generated”) because they are structured in such a way that they work well as filters. They also use terms and identify subjects that we thought may align better with the terms that people want to use and the subjects they find interesting. The AI tags make an interesting comparison alongside the subject tags from the collections management system. To pick an example at random, this drawing of a church has been tagged as “building; drawing; painting; art; outdoor; church”, which are generic and broad, whilst the formal categorisations are much more specific: “Holy Trinity Church; Hulme; Stretford Road”. Each has a very useful role, depending on your intention. They key thing is to be clear about the fact that the AI tags are not an official description of what you can see, but an aid to discovery.

Captions are a little bit different. Being used as alt text (and aria labels on our zoomable image viewers) they are destined for public consumption. Even if the majority of users are unaware of them, for others they are essential as a direct substitute for the image itself, so they need to be reasonably reliable. Our spot checks indicate that the captions may be vague but they are generally not too far wide of the mark. But let’s look at some examples from the front page

I mean, it’s not wrong… But is it useful?

It is, in part, wrong. But it’s probably not useless.

Solid.

Overall I think the jury is out on how much value the captions add (or indeed remove). It would certainly be good to conduct some testing with visually impaired users to establish whether or not this is, indeed, better than nothing!

Whether it matters in other contexts

We would absolutely not use any of the AI outputs for documentation *; that is, to make confident assertions of fact, or at least of authoritative interpretation. That’s not to say that the outputs couldn’t be held within a collections management system (KEmu, in the case of MCL), but if so then they should be flagged up as being generated and used to support findability or even for provocations, rather than to consult as accurate descriptions of something that human should really be assessing with their own eyes and domain knowledge.


Many thanks to Tim for his prompt. Like all good prompts it has led to a large amount of hard-to-verify text from some faceless entity on the far side of your screens 😊 Let us know if you have questions yourself about AI or any other aspect of our projects, the gnarlier the better!

  1. by documentation we mean the formal records of museum collections, probably in a collections management system, which need to be as accurate and precise as possible. ↩︎