Background information

New revenue streams for corporate publishers

For some time now, the major academic publishers have been redesigning their corporate business model from the ground up. These developments have had considerable effects on science and scholarship writ large: The aggregation and re-use or resale of user data is moving into the focus of publishing activities [1]. Publishers now expressly see themselves as data analytics companies [2]. The publishers’ business model is changing from a content provider to a data analytics business. The data of scientists (i.e. personalized profiles, access and usage data, length of stay at information sources, etc.) are tracked, i.e. recorded and stored, when using information services such as literature research. Science tracking is carried out using an ensemble of tools that range from tracking site visits via authentication systems to detailed real-time data on the information behavior of individuals and institutions. The recording of, among other things, page visits, accesses, downloads and thus granular profiles of scientific behavior is often carried out without the knowledge or with insufficient information on the part of the users. Data from various sources can be aggregated and combined with further information about the people, also from the non-scientific environment.

[1] Aspesi, C., Allen, N. S., Crow, R., Daugherty, S., Joseph, H., McArthur, J. T., & Shockey, N., SPARC Landscape Analysis, 2019, March 29, https://doi.org/10.31229/osf.io/58yhb.

[2] See, e.g., “a global leader in information and analytics” https://www.elsevier.com/about/this-is-elsevier

Publishers track researchers for multiple reasons

The reasons for this data collection by publishers are twofold: On the one hand, it is about opening up a new business field, dealing with data on knowledge, research developments and their actors. On the other hand, it is about expanding the monopoly structure of the large corporate publishers. The goal is to lock researchers into a single-provider system for all of their activities in the research workflow. At the same time, the data collection built into this system allows the provider to offer research information, e.g., to the research institution employing the users of this system for evaluation purposes. For example, RELX, the parent company of Elsevier, is selling the research information system ‘Pure’ to universities around the world with the explicit reference to providing insights into the entire research cycle [1]. Another example is the contract signed with Elsevier in the Netherlands in 2020. It expressly provides the publisher with the right to access user data in return for the transition to the Publish & Read model [2]. The publishing contract is part of the Seamless Access or GetFTR strategy [3], which provides for a self-contained information supply by the major providers and at the same time is aimed to facilitate the trade in scientific user data [4]. In other words, even if universities terminate contracts with major publishers, they are still often dependent on software, e.g. from Elsevier [5]. Elsevier is also a subcontractor of the European Commission to collect data on Open Science on their behalf (Open Science Monitor) [6]. In this way, comprehensive data collections are created in the hands of a few large corporations. The many alternative scholarly information resources, which have been developed in response to the obscene price increases by the corporate publishers (also with the support of the public sector), are deliberately marginalized, unless they are acquired and integrated into the information databases and research tools of the major providers (such as, e.g., Mendeley or Pure).

This development also significantly interferes with the anonymity of scientists, which is fundamentally guaranteed under data protection law, and makes scientific institutions jointly responsible for violating the right to informational self-determination. It also promotes data misuse and scientific espionage and can lead to personal discrimination against scientists.

[1] https://www.elsevier.com/solutions/pure

[2] https://www.vsnu.nl/en_GB/news-items.html/nieuwsbericht/597-nederlandse-onderzoeksinstellingen-en-elsevier-gaan-s-werelds-eerste-nationale-open-science-samenwerking-aan

[3] https://www.getfulltextresearch.com

[4] Moore, S. A., Individuation through infrastructure. Journal of Documentation 77(1) (28. July 2020), https://doi.org/10.1108/JD-06-2020-0090

[5] See the institutional users of Pure: https://www.elsevier.com/solutions/pure/clients

[6] Open Science Monitor: https://ec.europa.eu/info/sites/info/files/research_and_innovation/knowledge_publications_tools_and_data/documents/open_science_monitor_methodological_note_april_2019.pdf

How are the publishers tracking researchers?

There are three main types, i.e. methods of obtaining the user data collected and stored by the publishers:

  1. Microtargeting, that is data from the direct user traces combined with data purchased, which in turn is condensed into precise data profiles by third parties, especially the large Internet companies.
  2. Harvesting of Bidstream Data (real time bidding data), which is the collection of data running in the background on localization data, IP numbers, device information and much more, transmitted and linked with an identifier in order to reliably identify people without the need to set a cookie
  3. Trojans which libraries are offered in connection with discounts for other services. The additional software to be installed in the libraries collects biometric data such as typing speed or type of mouse movement in order to be able to personalize users despite the use of proxy servers and VPN tunnels [1]. The Scholarly Networks Security Initiative (SNSI) [2], founded by Elsevier and Springer Nature, advertises such practices and, in conjunction with companies such as PSI [3], argues that this allows users of shadow libraries to be identified and legally prosecuted. These Trojans undermine the security of university networks and potentially expose universities to all kinds of attacks.

There are also different tools for the diverse tracking methods: trackers for page visits, audience tools for aggregating different data sources into profiles, finger printers that also identify users who want to prevent identification through browser settings, and tools for real-time auctioning of user data make up the portfolio of the currently used tools to track researchers. The tracking tools mostly come from third-party providers of large Internet companies, but also from specialized companies such as the BlueKai big data platform belonging to Oracle, which is the defendant in a GDPR class action law suit about the misuse of personalized data [4]. Because the data are already linked with other data aggregators of the Internet companies, they can be condensed into profiles with other data from other areas of life. The publishers do not disclose how deep the follow-up is, so we can only refer to various tests that show that anyone who accesses articles in the journal Nature, for example, is tracked by more than 70 different tools. Finally, there is also the fact that the tools used are inaccurate and can therefore have all kinds of unwanted and unexpected detrimental consequences for individual researchers.

[1] Gautama Mehta, Proposal to Install Spyware in Universities Libraries to Protect Copyrights Shocks Academics, Coda, 13. November 2020, https://www.codastory.com/authoritarian-tech/spyware-in-libraries

[2] Scholarly Networks Security Initiative, https://www.snsi.info

[3] “PSI is an independent third-party, which enables libraries, publishers and membership societies to work together securely and confidentially towards the common goals of facilitating legitimate access to scholarly content, eliminating subscription abuse, eradicating IP misuse, and combating cybercrime.”, https://www.psiregistry.org

[4] Natasha Lomas, Oracle and Salesforce Hit with GDPR Class Action Lawsuits Over Cookie Tracking Consent. TechCrunch (14.August 2020), https://techcrunch.com/2020/08/14/oracle-and-salesforce-hit-with-gdpr-class-action-lawsuits-over-cookie-tracking-consent

Publishers already dominate the scholarly infrastructure

Infrastructure controlled by publishers

Further reading / links