Self-Archiving and Journal Subscriptions: Flawed Method and No Data Stevan Harnad 13 Nov 2006 00:31 UTC

        ** Apologies for Cross-Posting **

    Self-Archiving and Journal Subscriptions:
    Critique of Publishing Research Consortium Study

    by Stevan Harnad

The following is a critique of:

    Chris Beckett and Simon Inger, Self-Archiving and Journal
    Subscriptions: Co-existence or Competition? An international Survey
    of Librarians' Preferences. Commissioned by the Publishing Research
    Consortium ( from Scholarly
    Information Strategies Ltd (SIS), a scholarly publishing consultancy
    October 2006

    SUMMARY OF CRITIQUE: There is no evidence to date that Open Access (OA)
    self-archiving causes journal cancellations. The Publishing
    Research Consortium commissioned a study of acquisitions librarian
    preferences to see whether they could predict such cancellations
    in the future using a "Share of Preference model," but the study
    has a glaring methodological flaw that invalidates its conclusion
    (that self-archiving will cause cancellations). The study consisted
    of asking librarians which of three hypothetical products -- A, B
    or C -- they preferred least and most, for a variety of hypothetical
    combinations of 6 properties with 3-4 possible values each:
        1. ACCESS DELAY: 24-months, 12-months, 6-months, immediate access
        2. PERCENTAGE OF JOURNAL'S CONTENT: 100%, 80%, 60%, 40%
        3. COST: 100%, 50%, 25%, 0%
        4. VERSION: preprint, refereed, refereed+copy-edited, published-PDF;
        5. ACCESS RELIABILITY: high, medium, low
        6. JOURNAL QUALITY: high, medium, low
    No mention was made of OA self-archiving (in order to avoid "bias");
    but, as a result, the survey cannot make any prediction at all
    about the effects of self-archiving on cancellations. The questions
    were about relative preferences for *acquisition* among competing
    "products" having different combinations of properties, and it treated
    OA (0% cost) as if it were just one of those product properties. But
    self-archived articles are not products purchased by acquisitions
    librarians: they are papers given away by researchers, anarchically,
    and in parallel. Hence from the survey's "Share of Preference model"
    it is impossible to draw any conclusions about self-archiving
    causing cancellations by librarians, because the librarians were
    never asked what they would cancel, under what conditions; just what
    hypothetical products they would prefer over what. And of course they
    would prefer lower-priced, immediate products over higher-priced,
    delayed products! But if all articles in all journals were
    self-archived, the "Share of Preference model" does not give us
    the slightest clue about what journals librarians would acquire or
    cancel. Nor does it give us a clue as to what they would do between
    now (c. 15% self-archiving) and then (100% self-archiving). The
    banal fact that everyone would rather have something for free rather
    than paying for it certainly does not answer this question, or fill
    the gaping evidential gap about the existence, size, or timing of any
    hypothetical effect of self-archiving on cancellations. Nor does
    the study's one nontrivial finding: that librarians don't much
    care about the difference between a refereed author's draft and
    a published-PDF. (Let us hope that this study will be the last futile
    attempt to treat research as if it were done in order to generate or
    protect journal revenues. Even if valid evidence should eventually
    emerge that OA self-archiving does cause journal cancellations,
    it would be for the publishing community to adapt to that new
    reality, not for the research community to abstain from it, and
    its obvious benefits to research, researchers, their institutions,
    their funders, and the tax-paying public that funds the funders and
    for whose benefit the research is conducted.)

Because there has so far been no detectable correlation between author
self-archiving and journal cancellations, the Publishing Research
Consortium commissioned a survey of acquisition librarians' preferences
and attitudes about a number of hypothetical alternatives. From the
responses a theoretical model was constructed, which predicted
cancellations as more self-archived content becomes available. How did
the study arrive at this prediction without any actual cancellation

The prediction was based on a rather simple methodological flaw:
Librarians were given a series of hypothetical choices, each a choice
among three hypothetical "products," A, B and C. The librarians were
asked to pick which of the three product options they would prefer most
and least. Each hypothetical product option consisted of a complicated
combination of six properties out of 3-4 possible values per property.

Presenting this array of hypothetical product options as choices to
acquisition librarians (apart from being highly complicated and highly
hypothetical, with many hidden assumptions) is specious, for among the
potential properties of the hypothetical "product" options was the
property that some of the options were free.

But a free self-archived journal article is not a product: It is not
something that an acquisitions librarian decides whether or not to
acquire. Open Access (OA) is not a product-*acquisition* issue at all:
At best (or worst) its a product *cancellation* issue.

Hence the only credible and direct hypothetical question one could
have asked librarians about self-archived journal articles (and even
then there would be no guarantee that librarians would actually do as
they predicted they would do under the hypothetical conditions) would
be about the circumstances under which they think they would *cancel*
existing journals:

    "Would you cancel journal X if 100% of its articles were accessible
    free online (80%? 60%? 40%?)? If they were accessible immediately
    (after 6 months? 12? 24?)?"

And even that question is laden with highly speculative and even
indeterminate assumptions: How could librarians (or anyone) *know* what
percentage of a journal was accessible for free, self-archived, for any
particular journal?

And what about interactions between journal X and journal Y? (How to
spend a given acquisitions budget -- what to acquire and what to cancel
-- is presumably a comparative decision, and we are asking about the
keep/cancel trade-offs.)

But what if 60% of *all* journals were free online (immediately? after
12 months?)?  (Acquisition/cancellation decisions today are largely
competitive ones: X gets cancelled in favour of Y. The rules of this
trade-off game would presumably change if all journals were roughly on
a par for their percentage of freely available online content or the
length of the delay before it is freely available.)

Straightforward questions on what a librarian predicts they would cancel
(in favour of what) under what hypothetical conditions (and how those
conditions could be ascertained) might possibly have some weak
predictive value. But such straightforward questions are not what
this series of questions about preferences among hypothetical "product
options" asked.

[Even straightforward hypothetical answers to straightforward hypothetical
questions may not have any predictive value if the hypotheses are
far-fetched or unfamiliar enough, if they have hidden or incoherent
assumptions: I frankly don't believe there is a librarian alive who has
a clue as to what they would keep or cancel if the self-archived versions
of all journal articles were suddenly available free online today -- let
alone what they would do as all journal contents gradually approached
100% availability, at various (uncertain) speeds, from a trajectory
of increasing (but uncertain) free content (40% to 60% to 80%) and/or
decreasing delay (24 months to 12 months to 6 months).]

And that's without mentioning intangibles such as any continuing demand
for the paper edition, etc., nor how librarians could know the
percentages available, how quickly the percentages would grow, and at
what relative rate they would grow among more and less important
journals, more and less expensive journals.

But it was not even these straightforward, if highly speculative,
questions that were asked of librarians in this survey. Instead, they were
asked to pick the most and least favoured option among three hypothetical
"products," A, B and C, with a variety of complicated combinations of
6 hypothetical properties, which could each take 3-4 values:

    1. ACCESS DELAY: 24-months, 12-months, 6-months, immediate access
    2. PERCENTAGE OF JOURNAL'S CONTENT: 100%, 80%, 60%, 40%
    3. COST: 100%, 50%, 25%, 0%
    4. VERSION: preprint, refereed, refereed+copy-edited, published-PDF;
    5. ACCESS RELIABILITY: high, medium, low
    6. JOURNAL QUALITY: high, medium, low

In each case, products A, B and C were given some combination of the
values on properties 1-6, and the librarian had to choose which of the 3
combinations they most and least preferred.

>From samples of these combinations (interpolated and extrapolated within
and between librarians) the survey concludes that:

> A major study of librarian purchasing preferences has shown that
> librarians will show a strong inclination towards the acquisition [sic]
> of Open Access (OA) materials as they discover that more and more
> learned material has become available in institutional repositories.

(1) OA materials are not "acquired" (and it is both misleading and
absurd to cast either the questions or the responses in an acquisitions
context). Non-OA products are acquired, and the availability of OA
versions of them might or might not induce *cancellation* in favour of
other non-OA products under various circumstances (that are not even
touched upon by this study or its methodology).

Why would the model assume arbitrary differential rates of OA growth
among journals rather than roughly uniform growth across all journals in
each field (apart form random fluctuations)? And if there were systematic
differential OA growth within a field, wouldn't librarians' decisions
depend very much on the field, and on which journal contents happen to
became OA faster, rather than on any general predictions generated from
this theoretical model?

(2) Nothing whatsoever was determined about what happens as more and
more OA becomes available all round, nor about how availability would
be ascertained, nor at what rate OA would grow and be ascertained. There
were merely static questions about 3 hypothetical competing "products,"
some stipulated to be PP% OA within MM months.

> Overall the survey shows that a significant number of librarians
> are likely to substitute OA materials for subscribed resources,
> given certain levels of reliability, peer review and currency of
> the information available. This last factor is a critical one --
> resources become much less favoured if they are embargoed for a
> significant length of time.

The survey shows nothing whatsoever about libraries substituting OA
material for anything, because free self-archived content is not
something a *subscriber* institution (library) provides (by buying it
in) but something an *author* institution provides, via its IR, by
self-archiving it.

If the questions had been forthrightly put as pertaining to cancellation
decisions under various hypothetical conditions, then at least we would
have had librarians' speculations about what they think they would
cancel under those hypothetical conditions. But instead we have
inferences from a model based on least- and most-preferred "product"
options having little or no bearing on any question other than the
librarians' preferences for the hypothetical properties: They prefer
journals with lower prices, whose content is higher quality, more
reliable, more immediate, peer-reviewed, and preferably 100% of it.
(Librarians don't much care whether the peer-reviewed article is the
author's final draft or the publisher's PDF, as long as it's
peer-reviewed: That *is* a genuine finding of this study!)

There is no way at all to interpolate or extrapolate from data like
these to draw valid or even coherent conclusions about self-archiving
and cancellations, with or without a "conjoint analysis" model.

> One of the key benefits of the conjoint analysis approach used in
> this survey was the removal of bias by not referring, when testing
> different product configurations, to any named incarnations of
> content types, including subscription journals, licensed full-text
> (or aggregated) databases, or articles on OA repositories.

This "bias" was eliminated at the cost of making it a questionnaire
about *acquisitions* among a variety of competing "products" when it
should have been a questionnaire about *cancellations* under a variety
of hypothetical OA conditions (many of them unascertainable, hence

> The survey tested librarians' preferences for a series of hypothetical
> and unnamed products frequently showing unfamiliar combinations of
> attributes -- such as a fully priced journal embargoed for 24 months,
> or content at 25% of the price but through an unreliable service. By
> taking this approach, the survey measured librarians' preferences for
> an abstract set of potential products thus avoiding any pre-conceived
> preferences for named products, such as journals, licensed full-
> text (aggregated) databases or content on OA repositories.

Indeed. But OA is not an alternative product for acquisition: it is a
property that might or might not induce cancellation in favor of *other*
products under certain hypothetical (and presumably competitive)

> The data were abstracted into a "Share of Preference" model (or
> simulator) which has then been used to model real-life products and
> thus create predictions for librarians' real-life preferences for
> these products. It is therefore possible to go beyond the comparisons,
> in this work, of journals versus OA and to model other preferences,
> such as between OA and licensed full-text databases.

The "Share of Preference model" might be viable when the preference really
concerns competing products for acquisition, with a variety of rival
properties, but it fails completely when applied to free non-products,
not for acquisition at all, but treated as if they were just another
among the rival properties of products competing for acquisition.

We could have said a-priori that librarians (like all consumers) will
prefer a higher quality product over a lower quality product, 100% of a
product over 60% of a product, an immediate product over a delayed
product, a lower-priced product over a higher-priced product. A "Share
of Preference model" could give some rough rank orders for those various

It seems natural to add to such a "Share of Preference model" that
*consumers* will prefer a free product over a priced product, except
that we are talking here about acquisitions librarians, who do not
"acquire" free products but merely buy or cancel priced journals. This
study simply does not and cannot indicate under what OA conditions they
will cancel what for what.

The following (mild) conclusions, are the only ones that can be drawn:

>   There is a strong preference for content that has undergone
>   peer review.

Yes, and librarians don't much care whether the peer-reviewed content is
the publisher's PDF version or the author's final version -- except that
the publisher's PDF is for sale and the author's final draft is not! Nor
does the model tell us under what conditions, if both versions are
available for a journal X, librarians would cancel the publisher's PDF
(and in favour of what journal Y?). The question is never even raised.
That's the question the study was designed to answer, but the method
could not answer it. The survey might as well have asked the librarians
directly, for X/Y pairs of hypothetical or actual journals -- rather
than A/B/C triplets of hypothetical "products" -- banal questions such as:

    "If 100% of X were immediately available for free online and Y
    was not, and your users needed X and Y equally, and you could not
    afford both, and you currently subscribed to X and not to Y, would
    you cancel X for Y?"

I suspect that it is because -- in the absence of any actual evidence
of self-archiving causing cancellations -- a survey on hypothetical
cancellations of journal X in favour of journal Y (or no journal at all)
under various %OA and months-delay conditions would not have been very
convincing or informative that the survey instead resorted to "Share of
Preference" modelling. But I'm afraid the outcome is even less

> How soon content is made available is a key determinant of content
> model preference in librarian's acquisition behaviour; delay in
> availability reduces the attractiveness of a product offering.

Yes, immediate access is preferable to delayed access. And, no doubt,
if/when librarians are ever inclined to cancel a journal X because PP%
of its articles are freely available, they are more likely to do so if
that PP% is immediately available than if it is only available 24 months
after publication. But we could have guessed that without this study.

The question is: Under what circumstances are librarians going to cancel
what, when?  This study does not and cannot tell us. Relative preference
models can only tell us that they are more likely to do it under these
conditions than under those conditions (and we already knew all that).

Having said all this, it is important to state clearly that, although
there is still no evidence at all of self-archiving causing
cancellations, it is possible, indeed probable, that self-archiving will
cause some cancellations, eventually. No one knows (1) how soon it will
cause cancellations, nor (2) how many cancellations it will cause. That
all depends on (a) how much demand there still is for the print edition
and (b) for the journal's online edition at that time, (c) for how long
that demand lasts, and (d) how quickly self-archiving grows and
approaches 100%. (Perhaps someone should do a survey on people's
predictions about those factors!)

But regardless of any of this -- and regardless also of the validity or
invalidity of the present survey -- the possibility or probability of
cancellation pressure is most definitely *not* the basis on which the
research community should decide whether or not to self-archive and
whether or not to mandate self-archiving. That decision must be based
entirely on the benefits of OA self-archiving for research access,
impact, productivity and progress -- definitely not on the basis of the
possibility of revenue losses for publishers.

We do well to remind ourselves that these questions are not primarily
about what is or is not good for the publishing industry. They are about
what is and is not good for research, researchers, their institutions,
their funders, and the tax-paying public that funds the funders.
Research is supported and conducted and peer-reviewed and published for
the sake of research progress and applications, not in order to support
the publishing industry, or to protect it from risk.

And what is certain is that peer-reviewed research publishing can and
will successfully adapt to Open Access: How can it fail to do so, when
it is researchers who conduct the research, write the articles, perform
the peer review, read, use, apply and cite the research, and, now,
provide online access to it as well? Publishers are performing a
valuable service (in implementing the peer review and in providing a
paper and online edition) but it is publishing that must adapt to what
is best for research in the online age, definitely not research that
must adapt to what is best for publishing. And publishing can and will

    Berners-Lee, T., De Roure, D., Harnad, S. and Shadbolt, N. (2005)
    Journal publishing and author self-archiving: Peaceful Co-Existence
    and Fruitful Collaboration.

I might add that Dr. Alma Swan is not the super-ennuated (sic) Proustian
personage repeatedly cited in this PRC survey, but the cygnine author
of a number of landmark surveys, one of them reporting the only existing
evidence -- negative -- for a causal connection between OA self-archiving
and cancellations.

    Swan, A. (2005) Open access self-archiving: An Introduction.
    JISC Technical Report.

Stevan Harnad