OA Monitoring: why do we get different results?
Leiden University is 81st worldwide for open access and 1st in the Netherlands according to the CWTS Leiden Ranking 2020. But why do different ways of monitoring OA give different results?
This week the CWTS Leiden Ranking 2020 (for publication years 2015-2018) was published. Leiden University (including the Leiden University Medical Centre, LUMC) is ranked 81st worldwide (out of a total of 1176 institutions) and first place (out of 13 universities) on a national level with 70.5% open access (OA) (table 1). This is a growth of more than 10% with respect to the previous year (see also table 1).
The numbers produced by the Leiden Ranking are likely to differ from the results that will be revealed by the 2019 Dutch national monitoring report due to be presented to the Minister of Education, Culture and Science and published on the website of the Association of Universities in the Netherlands (VSNU) within the coming weeks. In this report we expect open access at Leiden in 2019 to count for around 60% of our publications.
Leiden University Libraries regularly shares its own OA results through its website and with the faculties. For example, in OA week last year a news item showing that over 70% of our publications were already open access for 2017 and 2018 and therefore reaching 75% open access would be a reasonable goal for 2019.
Why are all these numbers so different?
Table 1: CWTS Leiden Ranking for Leiden University in 2019 and 2020, Type of indicator = open Access, sort order: PP(OA)
(2019 ranking downloaded on 4th July, 2020; 2020 ranking downloaded on 8th July 2020)
Type / Rank
International Rank (N=963)
Netherlands Rank (N=13)
Netherlands Rank (N=13)
The differing percentages of OA can be explained by several factors: different stakeholders use different definitions of OA, different data sources, and different inclusion and exclusion criteria. But the precise nature of these differences is not always obvious to the casual reader.
In the next paragraphs we will look into the reports produced by three different monitors of institutional OA, namely, CWTS Leiden Ranking, the national monitoring in The Netherlands, and Leiden University Libraries' own monitoring.
The EU Open Science Monitor also monitors trends for open access to publications but because it does so only at a country level and not at an individual institution level, we have not included it in our comparison, however, the EU Monitor’s methodological note (including the annexes) explains their choice of sources.
We will end this blog post with a conclusion and our principles and recommendations.
CWTS Leiden Ranking
Every year CWTS (the Centre for Science and Technology Studies at Leiden University) publishes the Leiden Ranking, which "offers important insights into the scientific performance of over 1000 major universities worldwide" and includes OA as one of the indicators. By default the percentages are measured over a period of four years.
A strength of the CWTS report is that the same method and the same data source are used for universities all over the world.
CWTS enriches and makes use of the high quality Web of Science data in combination with the Directory of Open Access Journals (DOAJ) and Unpaywall. Using Web of Science does, however, mean that a lot of disciplines (such as humanities and law) are under-represented in the monitoring.
Although the definitions and methods are transparent and published on the website (see also Robinson-Garcia et al. (2020, March 17) "Open Access uptake by universities worldwide", accepted for publication in Peerj), the data is not shared and presumably cannot be shared because it is propriety. As the data is unavailable, it is impossible to check which publications have been analysed for the monitoring of the different universities. In that sense the results of the CWTS ranking are not completely transparent, nor completely reproducible.
National monitoring in The Netherlands
All thirteen Dutch universities report their OA numbers to the VSNU on an annual basis and monitor according to the same established VSNU definitions. This Definition framework monitoring Open Access specifies the categories to be counted for the national OA monitoring. Universities are allowed to specify their own methods for delivering the numbers according to these categories.
In terms of data sources, all universities use the data from their local Current Research Information Systems (CRIS). Currently five universities extract the figures by using a manual written by Delft University and two use an R-script originally written by Utrecht University. However, some universities, such as Leiden University, use their own method.
Each university can choose to collect the data at any time during the first six months of the year following the reference year, and the dates that different universities have previously chosen have varied from February until June.
The reason this makes a difference is that the later the date of counting, the more OA publications you will find. This is because the visibility and linking of OA versions lags behind the addition of the publication metadata to the CRIS system, often due to embargoes that prevent OA versions being made available until six months or more after publication.
Only one university excludes files from its count that were not OA during the relevant year of measurement (so papers that became OA after 1 January 2019 would not be included in the count for 2018 even if they were OA at the time of the analysis during 2019). All other universities in the Netherlands include anything that is OA on the day of their analysis and match it to the year of publication. Needless to say, these different choices lead to considerable differences in outcome. Universities that perform their analysis early will miss OA publications that come in those first months, and others who wait as late as possible will profit from these additions and achieve higher percentages of OA as a result.
Numbers are being reported per university and not de-duplicated on a national level, meaning that collaborative publications between Dutch universities are counted at both universities separately. If deduplication took place it would be possible to combine the figures from all Dutch Universities to arrive easily at a figure that represented the whole country.
Most open access experts at Dutch Universities are aware of these differences, but for others, the story behind the data is untold and differences will therefore not always be known to the data users. It is therefore especially important when monitoring OA to define what has been measured, which sources have been used, which methods have been applied, and when the analysis took place. To solve the problems caused by these variations a uniform method is now being developed nationwide.
Within Leiden University, there is a distinction between the central-level reporting, that is performed for the annual university reports and the national reporting, and the reporting produced by the Leiden University Libraries and faculties in order to monitor and better steer various OA projects. The differences lie in the actual sub-set of CRIS data that is being used, at which point in time, and with which external sources the internal data has been enriched.
At Leiden University Libraries we monitor each institution’s progress towards the 100% green OA policy, so as to help steer the direction of activities at the level of faculties and/or institutes. That means that we measure frequently and in a more granular way in order to review and optimize implementation of the policy.
To make the regular collection of data feasible, we helped the University of Utrecht develop an R-script they had written for their own use, so that it is more generic and flexible in terms of data sources and reporting options.
Two advantages of using this method to extract OA data from the CRIS are that faculties and institutions themselves (if they want to) can easily run the script for analysis, and that we are then all using the same method in a transparent way.
Transparent and reproducible monitoring
Recently the R script that can be used to monitor OA publications from the local CRIS or other sources was published on Github as an R package with an open source license.
By using this R package, the definitions and methods for OA monitoring become more uniform and transparent.
On 3rd – 4th June 2020, an online "bring your own data" workshop for support staff from all the universities in the Netherlands was organised by Utrecht University and Leiden University Libraries. Seven universities joined and within two sessions of two hours they had each successfully completed the analysis and the visualisation of their own CRIS data. That shows how efficient the monitoring can be done.
Because the R package is so efficient, frequent measuring can be carried out more easily. The R package also makes it possible to monitor OA on a faculty and institute level. More frequent monitoring on a granular level can help to steer the optimal implementation of an OA policy.
The advantage of relying on such a method is that the data is owned by the university and the results can easily be reproduced. In order to make our own results reproducible, the Leiden University Libraries' Centre for Digital Scholarship is investigating whether we can share anonymised datasets besides the method (in this case explained in the R package).
The disadvantage is that it requires the user to be able to run R to reproduce the results, which might be a threshold for people with less technical knowledge.
Most of the time, the methods and data behind the monitoring of OA publishing are not transparent and not reproducible.
The CWTS Leiden Ranking uses uniform definitions and methods making this international ranking ideal for analysing long term trends and comparing institutions on a worldwide scale. The disadvantages are that, due to the data source, some disciplines are underrepresented, the data is propriety and cannot be shared, and the results cannot therefore be reproduced.
The universities in the Netherlands, together with the VSNU, have set up a definition framework to monitor OA on a national level. Universities spend a lot of time (a few weeks per year is not unusual) in performing the analysis and although the universities use the same definition framework, in practice different methods are used and a lot of different choices have been made. As a consequence, the results of the monitoring are not comparable across the universities, but as long as the story behind the data collection is not told this will never be clear to third parties.
Our principles and recommendations
That is why, for the reasons described in this blog post, we at the Leiden University Libraries' Centre for Digital Scholarship:
- adhere to the open science principles and sign up to transparent and reproducible monitoring of OA publishing;
- will use the R package for efficient, transparent, reproducible and regular monitoring;
- are convinced that using the R package will have a stimulating effect if institutes, faculties, and the Leiden University Libraries can easily monitor themselves at any time of the year and keep track of progress;
- will do our utmost to share our data to be able to provide reproducible outcomes;
- strongly recommend other universities and research institutions use the R package for monitoring;
- are convinced that if all universities in the Netherlands would share their data, we could complete OA monitoring at a national level in 2 – 4 hours and even deduplicate the publications from the different universities.