Fwd: Re: Role of arXiv

Fwd: Re: Role of arXiv Stevan Harnad 10 Oct 2010 22:28 UTC
**Cross-Posted**

Begin forwarded message:

On 08/10/2010 12:56, "Stevan Harnad" <harnad@ecs.soton.ac.uk> wrote:

> On Fri, 8 Oct 2010, Monica Duke wrote:
>
>>> SH:
>>> Harvesting is cheap. And each university's IR will be a standard part of
>>> its online infrastructure.
>>
>> MD:
>> So far do we have enough (or any) evidence that harvesting is cheap? What
>> sense of cheap did you mean?
>
> A harvester does not have to manage the deposit or host the content, as
> Arxiv does. It need only harvest and host the metadata. There countless
> such OA harvesters sprouting all over (not to mention the Google
> Scholar!) -- and that's on the sparse OA content that exists today (c
> 5-25%). Harvesters will abound once the OA content rises toward 100%,
> thanks to OA self-archiving mandates by universities and funders.
>
> History will confirm that we are simply spinning our wheels as we keep
> banging on about publishing costs, repository costs, harvesting costs --
> while our annual research usage and impact burns, because we have not
> got round to mandating deposit...
>
> Stevan Harnad

From: Hugh Glaser hg  -- ecs.soton.ac.uk
Date: October 10, 2010 6:06:16 PM EDT
To: JISC-REPOSITORIES -- JISCMAIL.AC.UK
Subject: Re: Role of arXiv

Spot on Stevan.

It is the work of a day or two to write a harvester for OAI-PMH from scratch
(I know, I did it), although it is now pretty standard libraries. I know
others who have done the same. I also wanted to translate into RDF, which
added some effort.

It is then a case of letting it run and funding the maintenance and service.
We have not bothered much to keep it up to date, but we use the metadata all
the time for our applications, and it is not significant as a delta with all
the other metadata.

The biggest cost is repository software that does not conform to the
accepted view of OAI-PMH. Hopefully this will improve as more people
harvest.

To be concrete, we harvested over 1000 repositories, automatically finding
the details from the roar site, which seems to have resulted in 15G of data,
and then translated into about 24M triples and 21G of RDF.
20 times that, to use Stevan's lowest estimate, would be less than 1Tbyte,
which is not really a lot of cost - right now I could serve that and
probably run the whole system with harvesting for around $100/year on my
ISP.

So after the initial costs (a month or two to do a great job?), it is a day
a month plus $100.

The crucial thing here is, as Stevan says, that we are only talking
metadata. The idea of the web is to avoid copying stuff, with attendant
storage costs and synchronisation problems, and so the texts should be left
where they lie.

Best
Hugh Glaser