content by the numbers and the next steps

Posted in: Beta, Content design, Content Publisher

In March, the Digital team set out on an ambitious project to inventory

Our purpose was to learn more about the content we create and the publishers who write it. Gathering this knowledge with a thorough inventory process is something that I have wanted to do ever since I joined Bath in 2011.

This is what we found, how we found it and what our findings mean for how we plan, govern and build better content.

What we inventoried

Inventory of

We found 559,263 content assets live on That’s a lot of files.

Number of files by type
Number of files by file type

The three largest groups of files are images (214,261), HTML pages (160,398) and unknown (98,586). Unknown files are assets that could not be categorised by the inventory tool based on their file extension. There are also 12,227 audio files and 1,045 videos.

Number of files by location
Number of files by location

91% of these assets are held in our legacy publishing system, Dreamweaver. Only 50,334 have been published using our current content management system, OpenCms.

84% of the HTML pages have not been updated in the last 12 months.

Inventory of publishers

We also analysed log-ins to OpenCms. 348 authors are currently trained to use the system, but only 29 publishers regularly log in. We defined ‘regularly’ as averaging once a day over six months.

262 people have either not logged into the CMS or have used the tool less than 30 times during the same period, which is upward of 75% of publishers.

How we inventoried

Building a tool

In the past, we compiled inventories of manually. From the start, it was clear that this wouldn’t be an option if we wanted to inventory the entire site, including every single section and page, consistently in less than a month.

Content inventory tool
Content inventory tool

The biggest challenge was that a large number of our assets are not linked to or are password-protected, so third-party crawler tools like Content Analysis Tool (CAT) wouldn’t have been suitable. These tools work similarly to search engines, inventorying the assets they find by following links in HTML documents.

We decided instead to iterate on an existing PHP application. Our inventory tool works by querying the CMS, file store and Google Analytics API. It then processes the data before outputting the information as an HTML table.

Content inventory tool output
Content inventory tool output

The tool reports on the asset name, location and type. For HTML files, it also reports on the title, H1, number of unique page views, time on page and bounce rates.

This process standardises the output, making the data easier to analyse. For example, rather than returning a list of file types, the tool organises them into categories like ‘images’. Rather than spending days completing a single inventory, it now takes just minutes.

Inventorying the website

In March, we began inventorying based on a list of top-level folders drawn from OpenCms and the file store.

For over a month, we worked in pairs to inventory and verify the data gathered from over 300 individual sections.

We then identified the faculty or service responsible for each section and shared the inventory with the Lead Publisher to prepare for auditing.

Analysing the data


In April, our Business Analyst, Takashi Yonenaga, aggregated the inventory results to create a series of graphics which visualised the ‘big data’ we had collected.

For the first time ever, we were able to understand the composition of, the age of our content and how it could be broken down based on location, organisation and type.

This gave us an overview of and also provided us with folder-specific information.

Auditing the inventories

Over the past few weeks, we’ve worked with Lead Publishers from across the University to audit their digital content using the inventories we created.

In each inventory, the publishers are working out which content items they want to archive, which they want to transition to the new CMS, which of the new content templates they will use and what work needs to be done to the content to get it fit for the move.

The information in the inventory provides a data-rich guide to what content they have and what content is needed. As of this morning there are 90 inventories left to audit out of a total of 311.

Lessons learned

The data we have gathered is invaluable. It is not only helping publishers, but also enabling the University to make better decisions about how we plan, manage and create content in future. The plan is to integrate the inventory tool directly into the new publishing application and regularly inventory and audit our content.

Ultimately Bath is no different from any large publisher whose website has developed organically over time through devolved publishing rights - we all have a lot of baggage! The difference is we now know how much we’re carrying, and can work with publishers to decide what we should bring with us to the new

Posted in: Beta, Content design, Content Publisher