How it Works

The COKI Open Access Dataset measures open access performance for 227 countries and 61,383 institutions. This dataset includes countries having at least 5 research outputs, institutions that we covered when our dataset was based on Microsoft Academic Graph, and all other institutions with at least 50 research outputs. The COKI Open Access Dataset is created with the COKI Academic Observatory data collection pipeline, which fetches data about research publications from multiple sources, synthesises the datasets and creates the open access calculations for each country and institution. The data is then visualised in this website. The code for this website is available at the COKI Open Access Website GitHub project.

1. Fetch Datasets

Each week we collect a number of specialised research publication datasets. These include Crossref Metadata, Crossref Funder Registry, Crossref Events, OpenAlex, Unpaywall, the Research Organization Registry (ROR) and Open Citations. A subset of these datasets are used to produce the data for this website and the COKI Open Access Dataset, including Crossref Metadata, OpenAlex, Unpaywall and the ROR. The table below illustrates what each dataset is used for.

Dataset	Role
Crossref Metadata	Citations, Paper Title, Journal Name
Crossref Funder Registry	Funder
Crossref Events	Social Media and Internet Events
OpenAlex	Affiliation, Subject
Unpaywall	Open Access Status
Research Organization Registry	Institution Identifiers
Open Citations	Additional citation information

Table 1. Datasets and their roles.

2. Synthesis

After fetching the datasets, they are synthesised to produce aggregate time series statistics for each country and institution (entity type) in the dataset. The aggregate timeseries statistics include publication count, open access status, citation count and alt-metrics.

The synthesis occurs in three steps (Figure 1):

Creating a table of publications.
Filtering publications based on their Crossref Metadata type.
Grouping the publications by entity type and year of publication.
Computing aggregated summaries for each group. Each step of the process is explained below with examples.

Figure 1. COKI dataset analysis pipeline.

The table of publications is created by joining records from the research publication datasets on Digital Object Identifiers (DOIs); unique digital identifiers given to the majority of publications. Figure 2 illustrates how each dataset contributes to the publications table during the joining process, using the example of a single publication. Unique publications are discovered with Crossref Metadata, from which the publication’s DOI, Journal, Publisher, Funder identifiers and citation counts are derived. The publication’s Open Access status is computed using Unpaywall. The authors of the paper and their institutional affiliations are derived from OpenAlex. ROR is used to enrich the institutional affiliation records with institution details and map institutions to countries and regions. The COKI Open Access Dataset uses the ROR assignment of country codes to institutions.

Figure 2. How each dataset contributes to the publications table.

After creating the publications table, we filter the publications based on their Crossref Metadata type. The types we include in this process are journal articles, proceedings articles, reports, posted content, edited books, books, book chapters, book parts, book sections, reference books, monographs, reference entries, and other. However, we exclude types such as datasets, databases, components, report components, peer reviews, grants, proceedings, journal issues, report series, book tracks, and any with a null type.

Once the publications table has been filtered, the publications are grouped by entity type and publication year. For instance, as shown in Figure 3 below, publications are grouped by institution and publication year. The last step involves creating aggregate timeseries statistics based on the yearly groups of publications.

Figure 3. How the publications table is created.

3. Open Access Calculations

The Unpaywall dataset is used to calculate Open Access status, the calculations for Publisher Open, Other Platform Open and Closed Access are described in Table 2 below.

Category	Description	Unpaywall Query Details
Publisher Open	An article published in an Open Access Journal or made accessible in a Subscription Journal.	Where the Unpaywall journal_is_in_doaj field is True or where the Unpaywall best_oa_location location_type field is “publisher”.
Other Platform Open	The publication was shared online; on a preprint server, a university library repository, domain repository or an academic staff page.	Any article where any oa_location element in the Unpaywall data has the location_type “repository”.
Closed	A publication that is not either Publisher Open or Other Platform Open.	Where journal_is_in_doaj is False and best_oa_location is null.

Table 2. Open Access status calculations.

The calculations for the Publisher Open categories are described in Table 3 below.

Category	Description	Unpaywall Query Details
OA Journal	Published in an Open Access Journal.	We use the journal_is_in_doaj tag from Unpaywall to define this category which requires that there be some licensing information provided.
Hybrid	Made accessible in a Subscription Journal with an open license.	We check that the license field for the best_oa_location is not null and journal_is_in_doaj is False. This includes the value of “implied_oa” which covers cases where publishers have a general assertion of a license but it is not clear from the page.
No Guarantees	Made accessible in a Subscription Publisher with no reuse rights.	All cases where the best_oa_location is “publisher”, the license field is null, and journal_is_in_doaj is False.

Table 3. Publisher Open category calculations.

The calculations for the Other Platform Open categories are described in Table 4 below.

Category	Description	Query Details
Institution	Publications placed in institutional repositories, which are archives for storing and distributing an institution's research outputs. Includes repositories shared amongst multiple institutions.	Where we have manually matched a repository to an institution, or where oa_locations.repository_institution matches a ROR id with the ROR affiliation matcher, or where the domain from the pmh_id field matches a link from a ROR record.
Preprint	Publications deposited on servers that do not make claims about formal peer review. Generally non-peer reviewed manuscripts, including working papers on platforms such as arXiv, bioRxiv, SSRN, RePec etc.	Where we have manually classified a repository as a preprint server.
Domain	Publications from domain repositories, also known as disciplinary or subject repositories. A domain repository contains publications from a specific subject area. Examples include PubMed Central, Europe PMC and Econstor.	Where we have manually classified a repository as a domain repository.
Public	Publications from repositories that can be used by researchers from any domain and to deposit any form of output, including pre-prints, published manuscripts and datasets. Semantic Scholar, Figshare and Zenodo are a few examples.	Where we have manually classified a repository as a public repository.
Other Internet	Outputs on sites we have not classified. In practice these are copies identified by CiteSeer X which is in turn indexed by Unpaywall. It may include publications on academic staff pages, blogs and social networks. We do not directly track outputs on platforms such as academia.edu and researchgate.net.	Outputs found on CiteSeer X, which often point to academic staff pages and blogs. We do not currently track outputs from academic social networks. Also includes outputs from repositories that we have not yet classified.

Table 4. Other Platform Open category calculations.

To see the SQL script that calculates Open Access status, follow this link.

4. Limitations

The limitations of our methodology include:

Research outputs that do not have an associated DOI are not included in this analysis. While this means we did consider the contribution of over 100 million outputs, there is still a substantial contribution to the scholarly record not currently covered by this identifier system.
Funder data only exists from the commencement of the Crossref Fundref initiative and is not complete, with quality diminishing the further back in time you go.
Affiliation data used to link outputs to institutions (and then to countries) has limitations and biases. This is true of any dataset and we are interested in working with anyone interested to contribute to the improvement of this data over time.