« On the Cloud: commentsD2D Futures »

Saving on a cloud


Saving on a cloud


Several threads came together for me today in an undoubtedly useless thread, but I'll spin it out anyway. What's a blog for?

The various components are: Dave Winer's musing that there should be a generalized service for people to preserve their own content; thinking back to my work at the California Digital Library, where I hooked up CDL's Preservation Program with folks at Amazon's S3; a blog on Amazon's Elastic Compute Cloud (EC2) and its impact on something or other that I've completely forgotten; a conversation today with someone from UIUC who was talking about a model for repository (digital object archive) interoperation that I found myself not agreeing with philosophically, based on admittedly inadequate information; an email list proferring that Google may offer file-based storage for Google account users; an excellent review of what made YouTube successful as a service; and a touch of rye whiskey.

So at the end of this windy introduction, onto my evolving set of basic thoughts about preservation, and repositories.

  1. I do not believe that libraries lend sufficient additional value to interpose themselves in the process of digital curation, compared to the individual/scholar who has either created or alternatively gathered the content, and is in the best position to evaluate its saving. Although libraries may well initiate very high value preservation acts, they should act in this fashion as privileged individuals, and not monopolize act of preservation.
  2. Preservation should be initiated and controlled by the content owner or rights-validated gatherer, and content objects should not be gated or judged for exclusion except on the grossest factors (such as legality, rights issues, and so forth). Tagging, however, should be open to anyone within the eligible community. (Why heck, any god-fearing provider is able to take advantage of safe-harbor under DMCA.)
  3. I think end-user facing services are far superior to institutionally-arbitrated ones. Further, consumer-facing services that live at the level of the network, not within the bounds of an institution, are likely to have more traction and more visibility. In other words, I think net-based apps can gain sufficient scale to make consumer-facing or community-facing applications superior to institutionally held ones across a range of critical factors such as performance and utility.
  4. I think that academia is different enough from the general population's needs regarding storage in terms of content description, sharing, rights, intent of purpose, and potential or latent value to warrant their own community-based (higher-education) solution, which may not necessarily be distinct in form to a broader-based application.
  5. I think it is quite sub-optimal (a nicer way of saying that something is stupid) for universities to continue creating their own individual, branded respositories when they don't talk to each other very well, and scaling is generally limited to the capabilities of the institution. Among other things, a continuing bias towards institutional solutions creates gross inequities in support capacity across the diversity of universities, in their size, focus, and aspirations.
  6. I think P2P-based preservation strategies tend to possess significant technical administrative overhead to maintain peer-level coordination and cache consistency, compared to centrally-coordinated distributed solutions, and are probably at best an interim solution.
  7. The availability and costs of network infrastructure, and our understanding of how to enact services across it, has advanced to the point that applications can scale easily without requiring the federation of institutional deployments.

So what I would propose is that a consortia of universities, perhaps led by their libraries, initiates adequate development to enable the deployment of a clustered instance of MIT's DSpace open-source respository system instantiated on Amazon's EC2 infrastructure. This network-based, community-oriented preservation repository would be directly available to all HE end-users with minimal gating and minimal content review. Obviously an adequate higher educational governance model would have to be created, and a means for adequate remuneration of the effort, which might range from institutional membership fees, or tiered service offerings which could be rescinded in case of payment lapse, obviating the development of free-riders.

Sure, this is probably silly, and it might not work. But it is no more silly than individual instances of DSpace or any other repository software; it solves the problem of repository inter-operation; it takes advantage of network-based positive feedback; it binds universities together in common purpose in a big way; it benefits harvestability and therefore content discovery through its aggregation; and it increases content visibility.

Google could do this themselves by gussying up Google Base and then bundling it as part of Google Apps for Higher Ed.

Maybe we should try it first?

Mar 19, 2007 | Categories: DLF, DigLibs, Preservation | pbrantley

5 comments

Comment from: Dorothea Salo [Visitor] Email · http://cavlec.yarinareth.net/
I like this idea more than you might think I do. I am on record a number of times as saying that individual-school IRs are unlikely to be sustainable, and larger (consortial/statewide/whatever) repositories make more sense. And I'm also perfectly well aware that IRs are a hard sell owing to researcher loyalty to discipline far outweighing loyalty to institution.

That said, while I would tend to support the establishment of a giganto-cluster, because IR coverage in the States is patchy at best, I have some questions about it:

1) Content replication for preservation? Who does it? Who monitors it? Who runs the checksum checker? Who does something about problems that crop up?

2) Rights management? Legal representation? A giganto-cluster is a big fat lawsuit target. Individual IRs are many tiny targets. The last thing we want is to hitch all our wagons to a single repository, only to have it tied up in litigation for years or decades. Think Elsevier or the AAP wouldn't do it? Look at the deals the former is signing, and the threatening letters the latter is sending, and think again.

(And then think what happens to green OA in general. The inevitable lawsuit will be spun as "self-archiving is illegal," and most researchers will believe the spin. You know that as well as I do, I'm sure.)

3) You trust Amazon? I don't. We'd need some hefty quality-of-service guarantees -- which Amazon has shied very far away indeed from making.

4) What happens with content that DSpace is far from ideally designed to hold or disseminate? (The elephant in the closet here is video.) What about folks who would prefer EPrints or Fedora?

I think Google might have a hard time making such a service stick, honestly. Researchers enlightened enough to post their work to the Web either do it through administrative assistants or on their very own web space. In either case, there's an ownership/control issue at stake. Google would have to overcome that... and I'm not sure they can.

As for repository interoperation... I'm impressed with ORE so far. If it doesn't get decorated out of all sense (which it yet may; witness the horror that is OpenURL), it'll do the job.
03/20/07 @ 07:38
Comment from: Raymond Yee [Visitor] Email · http://blog.dataunbound.com
The idea of deploying DSpace on EC2 is probably easy enough and interesting enough for someone to prototype w/o a big investment. Might this type of experimentation be something that can be carried out under the auspices of a DLF "lab" aimed at exploring promising ideas?

-Raymond
03/20/07 @ 07:39
Comment from: Jerome McDonough [Visitor] Email
I am intrigued by your thinking on points #1 and #2. So, explain to me why you think decisions about how something is described should be handed to a community, but decisions about whether something is preserved should be handed to the individual. If you prefer 'community-facing' applications, shouldn't that extend to preservation? Traditionally, librarians have served as community proxies with respect to decisions on preservation. While we can certainly debate how good an idea that is, it seems very odd to me to say that now that we have networked information systems, we will throw open the doors of description to direct community participation, but not decisions on preservation. Why are you favoring the community in the one case, and the individual in the other?

There are at least two potential problems that may result from creating a large shared repository of the kind you propose and leaving decisions regarding archiving in the hands of individuals. 1. it decouples decisions regarding archving from any knowledge or concern regarding the resource allocation necessary to sustain the archive. You are setting yourself up for a tragedy of the commons situation if your archive truly appeals to the user community. Don't be too certain of your ability to provide the storage people need, not when there are people like me thinking about archiving moving materials, and more frighteningly, researchers working in areas like fMRI looking for places to store their data. Honestly, however, I don't think this problem is a likely one. Far more likely is problem #2: Experience with institutional repositories to date indicates that people are really, really, really awful about arranging for archiving their material, and that there can be a number of social factors working against individual's archiving works. Within the academy, self-archiving has confronted a variety of issues around negotiations with publishers and concerns about promotion and tenure, loss of control over data needed for projects, lack of concern regarding longevity of data *not* needed by projects, prohibitions imposed by IRBs, etc. Good policies/design may ameliorate these issues; but to the extent they do, you'll increase your odds of having the 'commons' problem.

On your point #3, I think you have some categories there that aren't mutually exclusive. Unless you mean something very specific by these terms, I don't see why an institutionally-arbitrated collection can't be as 'community-facing' as any other form. I note that your proposed solution involves a consortial arrangement -- in my personal ontology, consortium is a subclass of institution. So, I think you have a false dichotomy here. The question isn't insitutionally-arbitrated vs. community-facing. It's appropriate alignment of the institutional arrangements with the community of practice.

Which leads nicely to my final question: Who is your community of practice for the archive you are proposing?

I should probably note that I think your idea is a good one and something I *do* think we should explore. But the first step will *have* to be deciding who exactly this system will serve.

03/20/07 @ 08:43
Comment from: Barrie [Visitor] Email
On no. 1: Is the root of the issue process vs. practice?

Libraries have a much value to add in the practice of digital curation and issues regarding developing (read prescribing cum monopolizing) the rules of engagement for individuals/scholars to participate within an cross-institutional repository environment are likely to surface.
Preservation in reality is initiated by both individuals and institutions, e.g., letters in the attic and documents in presidential libraries, respectively. Hunting and gathering (selection and collection development) techniques are expressions of the values, habits, and beliefs of the collecting agent, and there are reasons to celebrate both individual and institutional tastes. The root of the problem is cultural/social. Many faculty just don't see the reward/value in using IRs, see "Institutional Repositories Evaluating the Reasons for Non-use of Cornell University's Installation of DSpace," D-Lib Magazine (March/April 2007).

On network-based, commercially-developed UIs vs. institutionally-arbitrated ones:

The former are more savvy concerning user's functional requirements because they focus on practice rather than process, and are obviously more concerned about driving traffic towards their branded products/services in the present rather than for long-term preservation. As another cultural issue, users are likely to go for ease of use, despite issues down the road. For example, think del.icio.us social bookmarking vs. an input form invoking simple DC. Which would you prefer?

On a consortial, generalized (monolithic) service driven by users vs. stovepipe operations driven by field experts:

One priniple to recognize in the former scenario is a potential to leverage the Condorcet Jury Thereom concept that the probability of a group effort "getting it right" increases as the size of the group increases covered by Cass Sunsten covers this in _Infotopia_. This suggests a likely high degree of success for the plan laid out in the above blog entry for both the user's participation in data curation and that consortial partners will play nicely together and iron out the wrinkles to make the thing work. Realistically there are already less formal solutions for this sort of problem being discussed among communities and networks of practice. The danger in formalizing a consortium is formalizing a consortium, which may not be as nimble as it's commercial competitors to keep abreast with the changing environment.

The other side of the coin is that libraries, with all the inadequencies maintaining their own collections and developing and complying with library conventions and standards in order to share, have managed to work together to support teaching and learning while maintaining distinctive collections and services. The resulting heterogeneous landscape is one reason why so much content has survived the test of time. Whether its a library consortium or Google, there is a risk involved in putting all the eggs in one basket. The diversity and diaspora of collections is not necessarily a bad thing, and although a consortial model may seem like the right economic solution it will have its own species of administrative, financial, and other practical issues of the same family from which came the problems it was created to solve.

In the end, this doesn't mean the experiment is not worth the effort.
03/28/07 @ 04:42
Comment from: Michael A Keller [Visitor] Email · http://library.stanford.edu
Peter,
REsponding to your "ete of basic thoughts" by number...
1. Our surveys of Stanford faculty indicate that they would value and would use the Stanford Digital Repository, because it promises to provide reliable and consistent archival services long after they have forgotten to migrate or otherwise transform their own output. They also want to deposit the by-products, working papers, tech reports, and other salient i.p. from their research that almost certainly would not be widely distributed. Also, the presence of librarians or archivists in the chain of decision makers regarding what to deposit has been in the past, is now, and will be in the future important. Sometimes creators of i.p., of archival materials, either save more than is necessary to reflect the historical record and sometimes they throw away material that could be valuable. And the engagement of the creators, the scholars, with those responsible for the long term care, managed care, of digital archives in the making of policies, reviewing of operations, and assessment of the accurate preservation of digital objects is essential.
3. Net-based apps can also persuade those institutions which could be and maybe ought to be providing digital curation an excuse not to do so. As an analogy, look at the lack of investments in preservation programs for paper-based library and archival materials, even among the largest of research libraries in the U.S. Because others were supposedly running preservation departments or experimenting with various mass de-acidification processes, most university libraries invested very little in preservation. Also, are not net-based apps fragile regarding unintended benefits, such as preservation of digital objects. What happens to sites that go missing from the Internet Archive or Google or Yahoo or ???
5. Sub-optimal or stupid, there is no digital archiving program that one and all regard as the single best way. Indeed, the single best way approach, whether coordinated across institutions or not may result in a single point point of failure. Should we not regard this period of development of digital archives with lots of different experiments under way as one that may result in one or more successful experiments? And in any case, diversity in approaches has in the past created success through variety. Why shouldn't there be diversity of approaches whether in size, in focus, in aspiration, in i.t. architecture, in policies, and so forth. We will all learn from the diversity, especially if each digital archiving program makes its methods, policies, and aspirations clear. All should have 3rd party audits performed on them. And the reports of the audits should be made public. We need more open debate, more peer review, more engagement, not less. Fortunately, the cost of engagement and the cost of scaling goes down measurably each year, even if the demands go up.
6. What is the basis for your thoughts about the "significant technical administrative overhead to maintain peer-level coordination and cache consistency" in P2P preservation strategies? And how does that overhead vary from the implied overhead in your comments in 3 & 5? And shouldn't whatever overhead is calculated take into account the varying other overheads in each of the other approaches? I'm thinking of the "nomalization" inherent in Portico and PubMed Central, for instance.
7. The very behaviors of our researchers within universities, much less across them, suggests that federation is with us anyway. Do we need another centrally coordinated approach or perhaps the one you suggest is different enough to try.

If a "networked based, community oriented" digital archive was to operate, some care would have to be taken to insure that the efforts and treasuries of the few did not end up taking care of one and all. Proposals for new, 3rd party suppliers of digital archive services might lead again to more pleas to librarians and archivists not to evolve their programs, services, and methods in the digital age.
04/02/07 @ 09:25

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
8 + 2 = ?
antispam test
This is the personal blog of Peter Brantley, and the opinions expressed here are his own and are not reflective of any of his employers in the continuum of history, or the University of California, which provides support for this blog.

Join EFF today

Recent Posts

Search

Subscribe

  • RSS
  • Bloglines
  • MyYahoo!
  • MyMSN
  • Newsgator
  • Google Feeds
How to subscribe
powered by b2evolution

Server manager: contact