The COKI Open Access Dataset measures open access performance for 225 countries and 52,441 institutions. This dataset includes countries having at least 5 research outputs, institutions that we covered when our dataset was based on Microsoft Academic Graph, and all other institutions with at least 50 research outputs. The COKI Open Access Dataset is created with the COKI Academic Observatory data collection pipeline, which fetches data about research publications from multiple sources, synthesises the datasets and creates the open access calculations for each country and institution. The data is then visualised in this website. The code for this website is available at the COKI Open Access Website GitHub project.
Each week we collect a number of specialised research publication datasets. These include Crossref Metadata, Crossref Funder Registry, Crossref Events, OpenAlex, Unpaywall, the Research Organization Registry (ROR) and Open Citations. A subset of these datasets are used to produce the data for this website and the COKI Open Access Dataset, including Crossref Metadata, OpenAlex, Unpaywall and the ROR. The table below illustrates what each dataset is used for.
Dataset | Role |
---|---|
Crossref Metadata | Citations, Paper Title, Journal Name |
Crossref Funder Registry | Funder |
Crossref Events | Social Media and Internet Events |
OpenAlex | Affiliation, Subject |
Unpaywall | Open Access Status |
Research Organization Registry | Institution Identifiers |
Open Citations | Additional citation information |
Table 1. Datasets and their roles.
After fetching the datasets, they are synthesised to produce aggregate time series statistics for each country and institution (entity type) in the dataset. The aggregate timeseries statistics include publication count, open access status, citation count and alt-metrics.
The synthesis occurs in three steps (Figure 1):
Figure 1. COKI dataset analysis pipeline.
The table of publications is created by joining records from the research publication datasets on Digital Object Identifiers (DOIs); unique digital identifiers given to the majority of publications. Figure 2 illustrates how each dataset contributes to the publications table during the joining process, using the example of a single publication. Unique publications are discovered with Crossref Metadata, from which the publication’s DOI, Journal, Publisher, Funder identifiers and citation counts are derived. The publication’s Open Access status is computed using Unpaywall. The authors of the paper and their institutional affiliations are derived from OpenAlex. ROR is used to enrich the institutional affiliation records with institution details and map institutions to countries and regions. The COKI Open Access Dataset uses the ROR assignment of country codes to institutions.
Figure 2. How each dataset contributes to the publications table.
After creating the publications table, we filter the publications based on their Crossref Metadata type. The types we include in this process are journal articles, proceedings articles, reports, posted content, edited books, books, book chapters, book parts, book sections, reference books, monographs, reference entries, and other. However, we exclude types such as datasets, databases, components, report components, peer reviews, grants, proceedings, journal issues, report series, book tracks, and any with a null type.
Once the publications table has been filtered, the publications are grouped by entity type and publication year. For instance, as shown in Figure 3 below, publications are grouped by institution and publication year. The last step involves creating aggregate timeseries statistics based on the yearly groups of publications.
Figure 3. How the publications table is created.
The Unpaywall dataset is used to calculate Open Access status, the calculations for Publisher Open, Other Platform Open and Closed Access are described in Table 2 below.
Category | Description | Unpaywall Query Details |
---|---|---|
Publisher Open | An article published in an Open Access Journal or made accessible in a Subscription Journal. | Where the Unpaywall journal_is_in_doaj field is True or where the Unpaywall best_oa_location location_type field is “publisher”. |
Other Platform Open | The publication was shared online; on a preprint server, a university library repository, domain repository or an academic staff page. | Any article where any oa_location element in the Unpaywall data has the location_type “repository”. |
Closed | A publication that is not either Publisher Open or Other Platform Open. | Where journal_is_in_doaj is False and best_oa_location is null. |
Table 2. Open Access status calculations.
The calculations for the Publisher Open categories are described in Table 3 below.
Category | Description | Unpaywall Query Details |
---|---|---|
OA Journal | Published in an Open Access Journal. | We use the journal_is_in_doaj tag from Unpaywall to define this category which requires that there be some licensing information provided. |
Hybrid | Made accessible in a Subscription Journal with an open license. | We check that the license field for the best_oa_location is not null and journal_is_in_doaj is False. This includes the value of “implied_oa” which covers cases where publishers have a general assertion of a license but it is not clear from the page. |
No Guarantees | Made accessible in a Subscription Publisher with no reuse rights. | All cases where the best_oa_location is “publisher”, the license field is null, and journal_is_in_doaj is False. |
Table 3. Publisher Open category calculations.
The calculations for the Other Platform Open categories are described in Table 4 below.
Category | Description | Query Details |
---|---|---|
Institution | Publications placed in institutional repositories, which are archives for storing and distributing an institution's research outputs. Includes repositories shared amongst multiple institutions. | Where we have manually matched a repository to an institution, or where oa_locations.repository_institution matches a ROR id with the ROR affiliation matcher, or where the domain from the pmh_id field matches a link from a ROR record. |
Preprint | Publications deposited on servers that do not make claims about formal peer review. Generally non-peer reviewed manuscripts, including working papers on platforms such as arXiv, bioRxiv, SSRN, RePec etc. | Where we have manually classified a repository as a preprint server. |
Domain | Publications from domain repositories, also known as disciplinary or subject repositories. A domain repository contains publications from a specific subject area. Examples include PubMed Central, Europe PMC and Econstor. | Where we have manually classified a repository as a domain repository. |
Public | Publications from repositories that can be used by researchers from any domain and to deposit any form of output, including pre-prints, published manuscripts and datasets. Semantic Scholar, Figshare and Zenodo are a few examples. | Where we have manually classified a repository as a public repository. |
Other Internet | Outputs on sites we have not classified. In practice these are copies identified by CiteSeer X which is in turn indexed by Unpaywall. It may include publications on academic staff pages, blogs and social networks. We do not directly track outputs on platforms such as academia.edu and researchgate.net. | Outputs found on CiteSeer X, which often point to academic staff pages and blogs. We do not currently track outputs from academic social networks. Also includes outputs from repositories that we have not yet classified. |
Table 4. Other Platform Open category calculations.
To see the SQL script that calculates Open Access status, follow this link.
The limitations of our methodology include: