CAT | Technology
I wrote an article for the journal code4lib “Using a Web Services Architecture with Me, Myself and I” and I keep realizing all of the things it is missing. But that is what a blog is good for, right?
There is something that just feels right about creating three applications all working in concert to do the job of a single application: it feels a little bit messy, but good messy. It is not that the code is sloppy or carelessly composed. And while I wouldn’t necessarily go so far as to use cliches about the whole being greater than the sum of the parts, the messiness is what makes the application lifelike. In other words, it is like a library. It is as if each individual application comprises a different department making a contribution to the entire teaching and research mission of the library.
Part of this line of thinking is influenced by an excellent article my colleague Allan forwarded, “Design in the Age of Biology.” In it, the author discusses what he calls the rise of service design. He characterizes service in the following way:
Robert Lusch [14] wrote about changes in marketing, describing a service-dominant logic in which “value is defined by and co-created with the consumer rather than embedded in output.” The “make-and-sell” strategy of linear value chains gives way to the “sense-and-respond” strategy of self-reinforcing “value cycles.” Lusch described traditional goods-centered dominant logic as focused on “operand resources,” tangible assets with inherent value. He contrasted that logic with emerging service-centered dominant logic focused on “operant resources,” intangible assets, which create value in their use, such as skills, technologies, and knowledge.
In our case looking at the way in which our applications operate, the value is derived from continuing to further develop their service orientation. Their value is initially based on the service providing behavior: they expose data that is reused and repurposed by other applications. But now I am finding that there is a self reinforcing cycle that is beginning to emerge as we discover other ways to put that data to work.
Which is to say, these applications are beginning to take on a life of their own.
My former boss and colleague Andrew Pace recently commented on the nature of the network and how he was rebuffed by a colleague for overlooking the fact people that make up the network and this is the most sigificant piece of a network. I would like to respectfully disagree with his post. Andrew used to boast that he is 100% right 50% of the time and in this case I believe he was right during the initial part of his musings on this topic.
What is the significance of the network in the 21st century? What we understand as the network is a contemporary realization, or maybe the automated reality, of the old adage that the total is greater than the sum of its parts. And quite frankly this realization was made possible by the amazing things that computers are doing with data.
Page Rank is arguably the shot heard throughout the Web. With their Page Rank algorithm Google was able to solve a problem that was plaguing relevancy in Internet search results: we’re all a bunch of dirty rotten liars. Back in the Yahoo/Alta Vista early days of search engines people were figuring out ways to game the system by lying through their metadata. In order to have their crappy cover band’s web page show up when a user searches for the Rolling Stones the cover band simply needed to put ‘rolling stones’ into its metadata.
Page Rank came along and solved the problem by saying, ok, we will let the network sort out the relevancy and if the network can prove that your website is a good one, you will be rewarded in search results rankings. This is the significance of the network. For better or for worse, the network can prove whether or not the data byproduct of the people is in fact worth what those people claim it is worth.
As Ian Ayers points out in his book Super Crunchers, the world is now using data to make better predictions than traditional experts. What is more, the statistical models being used by doctors, corporations, governments and non-profits are able to leverage the network effects of large data sets to verify how well those predictions are performing and improve those predictions instantly as new data becomes available.
I believe that my issue here is all sematics and I may simply be quibbling over something petty. However, I am splitting hairs over this point because this is a troubling area for libraries in my view. If we get caught up in the mushy people narrative over one of the most significant cultural shifts that is occurring right now, we will miss the point and consequently we will miss the opportunities to maintain the cultural relevancy of libraries in the future. The danger, in my opinion, is similar to the paralogism that because I know the structure of a MARC record I understand how it is stored in a modern RDBMS.
It is imperitive that we know how Lucene/Solr works so that we can make better resource discovery systems. It is similarly imperitive that we understand how to get in the super crunching game. As Andrew and his colleague Lorcan Dempsey have noted on numerous occassions, we need to do much more with our data, because it’s the network effect, stupid.
(For the record, I do not intend to call either Andrew or his colleagues stupid, I am just leveraging a theme that he and I have been riffing on for a couple of years.)
Here is a simple question with profound implications: is library search the same thing as the “search” in the way the population at large understands search or Googling?
The question is very simple and one that I think has been in the back of my mind for quite some time, but I just read an excerpt on statelessness on the Web from RESTful Web Services that provided me with a new way to frame the question. Richardson and Ruby write:
When you ask for a directory of resources about mice or jellyfish, you don’t get the whole directory. You get a single page of the directory: a list of the 10 or so items the search engine considers the best matches for your query. To get more of the directory you must make more HTTP requests. The second and subsequent pages are distinct states of the application, and they need to have their own URIs: something like http://www.google.com/search?q=jellyfish&start=10. As with any addressable resource, you can transmit that state of the application to someone else, cache it, or bookmark it and come back to it later. (emphasis added)
Here the user behavior seems to be: “Hey, Google, show me whatcha got for jellyfish.”
When I go to my library’s catalog and search for the word jellyfish I think my behavior is different because my expectations are different. I am not expecting the top 10 items on the topic. I am instead doing two different things:
- First, determining whether anything exists on the topic at my library
- Second, retrieving and evaluating a list of these items if they do in fact exist
The difference is that of course Google will have information on a topic because Google aggregates everything (or so it goes in the popular consciousness). The library on the other hand should have something on your topic if your topic serves one of the known collection areas of the library. Understanding the stateless nature of the Web seems to bring this out. The following URIs do not reveal the same state:
- http://www.google.com/search?q=jellyfish: what are the ten best resources about jelly fish according to Google
- madcat.library.wisc.edu…Search_Arg=jellyfish…: how many, if any, resources about jellyfish does my library have
In designing the interfaces for a library catalog front-end, it would be important to be mindful of this distinction since you are answering two very different questions.
It does not matter that Microsoft may buy Yahoo–the acquisition is based on a flawed premise. Technology companies cannot operate like the GEs and General Motors of the world and serve as the be-all-end-all of technology. The New York Times today put the acquisition in the right context. Describing the business culture of Silicon Valley, they write:
The economist Joseph Alois Schumpeter had a name for this principle of capitalism: creative destruction. Perhaps nowhere does it play out more dramatically — and more rapidly — than in Silicon Valley, where innovation unleashes a force that creates and destroys, over and over.
Technology companies are susceptible to creatively destructive forces when they try to expand too far beyond their original mission. Technologies like computer programming can only be successful if they break problems into smaller pieces that individually solve only a single component of the larger goal. At the time of writing, a computer programming function is defined by the masses (Wikipedia) as “a portion of code within a larger program, which performs a specific task and can be relatively independent of the remaining code” (my emphasis). This principle of modularization at the most basic level of contemporary information technology is important to a technology organization’s business model.
Microsoft and Yahoo both fail so horribly at the world of search and Internet advertising because those problem domains lie at the heart of neither companies’ core service: the operating system/desktop platform and the Internet portal. The reason Google so thoroughly dominates the world of search and Internet advertising is because that is its only core. Everything it does revolves around this core service and all of its activities support this model. The moral of the story is that you must choose your core, your identity and your raison d’être and you must choose it wisely because trying to be all things to all people is a futile exercise.
What does this mean for libraries? In the techie realm of libraries, an institution needs to determine what its core mission is and decide how it will define itself in a world of creative destruction. It will need to be able to clearly and succinctly articulate what those goals are to its affiliate institutions: universities or local governments. The library must not try to do everything; as the current computing paradigm of APIs and web services demonstrates, technology works when it is implemented singularly and exceptionally, but in a manner that is open and unafraid of sharing its data and services.
And finally, the modern library must not be afraid to get in the game and take a turn at trying to creatively destroy the old guard, lest it fall prey to the fate of the Yahoos of the world.
The New York Times has a short piece on a new Google service called Knol that sounds like it could have been conceived by librarians:
“We believe that knowing who wrote what will significantly help users make better use of web content,” wrote Udi Manber, vice president of engineering, on the official Google blog.
The service appears to be a wiki-style hosting service that puts a premium on identifying authorship.
23
2007Modeling Things or Revealing Things
0 Comments | Posted by Steve in Cataloging/Classification, Technology
Karen Coyle has a great piece on Hierarchies vs. Relationships in bibliographic modeling. She points out that the point of the FRBR model is not so much the hierarchy that you get to model, but the relationships that you can reveal among things.
This is a keen insight in my view since it really begins to get at the fun stuff that the Googles, Amazons, etc are doing with data that libraries long to do with bibliographic data. Coyle starts to articulate something here that I have not been able to put my finger on: the way that FRBR is a huge step forward but still only has an eye toward an implementation rooted in the way libraries have traditionally done things.
My library right now has been in discussions about subject guides and how to best build and provide access to them. I have felt for some time now that it would be great to get out of a next-generation catalog a system that imparts the kind of knowledge our librarians and subject liaisons put into these projects. Coyle’s post renewed this thought by framing the new catalog model in terms of a “Knowledge Management system,†which to my mind is the true aim of a discovery system.
In the past when I have tried to express a hybrid of a next-generation catalog and a subject discovery tool, I have always framed it in terms of applying graph theory to bibliographic data. I think Coyle’s post helps me to understand this. It seems obvious to use subject terms and call number ranges as one type of edge/vertex for nodes which are bibliographic items. However, her discussion raises the possibility of a new set of different kinds of edge types: translations, abridgements, extensions, etc.
Unless Karen Coombs is writing about some other reference statistics tracking package that has an (until recently) undocumented dependency on Pear::DB, her blog post calls out one of the (numerous) failings of Libstats: Installation is difficult for a lot of people. I get a lot of questions from people who have trouble with mod_rewrite or don’t know DB is required or various other things.
I’ve had similar negative experiences with open-source software, and actually releasing something gave me a much better understanding of why things wind up like this.
A few years ago, our library decided to write a reference tracking system and pilot it at a few libraries across campus. Since I was, then, the only developer at our library, the task fell to me. Once the system had proven successful at Madison, I thought, “Hey, maybe other people would like this, too.” I got the OK from my boss to release the code under an open-source license.
This, it turns out, is tricker than it might seem. All of those steps I’d fumbled through to make the software run, I had to eliminate, or at least explain, to people installing this software on the servers they have on hand. Databases need to be created and populated with initial data. Web servers need to be configured. Did I want to provide a demo? Screenshots? Big software projects provide installation wizards, but writing those is a bunch of work, and from my boss’s perspective, the software was written and done, and I had other projects to work on.
Then, there were concerns over the quality of the code. There’s some ugly shit in there. Did I really want people looking at that, and pointing and laughing? What if there’s a security bug in the code that could compromise someone’s server? Even if it relies on server misconfiguration, I’d feel pretty lousy if my code got someone hacked. How will people find out about, obtain, and install patches? Seriously, I wondered, is it even worth the work it’s gonna take to release this code?
Finally, I decided that it was worth the work, and that I’d release it, warts and all, in the hopes that it would be useful to some people. In the time since then, I’ve realized that the motivations of an open-source developer are different from that of a commercial project manager. I don’t get any reward from wide adoption, except a warm fuzzy feeling inside and possibly bragging rights if I make something exceptionally neat.
The bottom line: There’s a large cost and a limited benefit to making an open-source project into an open-source product, and that work will never ever happen as long as the project is only used internally — it’s not needed.
Here’s the question, then: Is it better to release something half-baked, in the hopes that it will be useful, or to keep it purely internal and let someone else solve the problem?
(On the particular topic of not documenting the Pear::DB requirement: when Libstats was released, DB was part of the standard PHP install, so this wasn’t a common issue. Reworking the code to use Pear::MDB is the right option, but that’s nontrivial.)
Pubcookie is pretty neat. It lets you authenticate against a login server without ever personally seeing the user’s password — it’s all handled via clever web server modules, redirects, and the REMOTE_USER variable. But, when you go to build a web app with it, you’ll likely find yourself pining for session-based logins. Fortunately, it’s easy to build an OpenID service that’s backed by Pubcookie. Here’s how:
What you need
- A web server with working Pubcookie authentication.
- An OpenID server. I had good luck with PHP-OpenID, and I’ll be using their example server in this post.
Set up your identity URLs
OpenID identity URLs are what people enter in OpenID login boxes around the net. The pages they point to aren’t anything special — in the simplest case, they just need to have a link to your OpenID server (also called a ‘provider’). It’ll look like:
<link rel="openid.server" href="http://example.edu/op/server.php" />
I used Apache’s mod_rewrite such that all URLs of the format:
http://example.edu/id/<username>
Would be valid identity URLs, linking to an identity provider service.
Note: Your identity URLs don’t need to be served over HTTPS, and they must not be protected behind Pubcookie.
Set up the OpenID provider
Follow your package’s installation notes, and get one statically-defined identity URL working. Also test to make sure the other OpenID identity URLs you’re providing don’t work.
If you’re looking for a place to test URLs, try this OpenURL test service. Your provider URL can’t be behind a firewall or protected by Pubcookie — other web servers need to talk to it.
Make note of the name of the session key your OpenID library is using. By default, PHP-OpenID uses openid_server. You’ll need it in the next step.
Make Pubcookie set a session variable
Here’s the magic step. You need a script, protected by Pubcookie, that puts the value of REMOTE_USER into your session (remember, your provider can’t be behind Pubcookie) and redirects you to your OpenID provider’s login URL. Since no one can view this script without authenticating via Pubcookie, and this script is the only place this session variable can be set, you need to go through Pubcookie to set this variable.
I put this script in http://example.edu/op/pubcookie/index.php:
session_name('openid_server');
session_start();
$_SESSION['pubcookie_user'] = $_SERVER['REMOTE_USER'];
header("Location: http://example.edu/op/server.php/login");
Hack your OpenID provider to respect the session
Here, you want to find the code in which authentication is checked, and replace it with a check for the session variable you set above. In this example, I replaced action_login() in actions.php with:
function action_login() {
if (isset($_SESSION['pubcookie_user'])) {
$info = getRequestInfo();
$openid_url = "http://example.edu/id/".$_SESSION['pubcookie_user'];
setLoggedInUser($openid_url);
return doAuth($info);
}
else {
return login_pubcookie_render();
}
}
I also added login_pubcookie_render() to render/login.php — it simply uses redirect_render() to send visitors to the pubcookie-protected page. Anywhere else in the code you’re showing the login page, use login_pubcookie_render() instead.
Finally, you’ll want to do a check in the method that actually does the authentication to make sure the identity URL matches with the Pubcookie username — you don’t want people to use their own credentials to log in as someone else. In common.php, I added a check to the start of doAuth():
if ($req_url != $user) {
return login_pubcookie_mismatch($user, $req_url);
}
And added a login_pubcookie_mismatch() method to login.php, which warns that their username and URL don’t match, and that they should fix that situation.
Log out of everything and give the OpenID test a try. It should redirect you to your Pubcookie login system, and from there, to a working ID.
9
2007An alternate view on the catalog’s purpose
0 Comments | Posted by Steve in Collection Management, Technology
I have to strongly disagree with what I saw as Nate’s primary point in his last post, What I want from a catalog. First, he pointed out that, “Library catalogs, by definition, contain only your library’s stuff,” and went on to conclude that this “is the killing blow to any idea of catalog-as-research-tool.” The primary argument is that a library can never compete with the amount of data amassed by the likes of Google or Amazon or Worldcat.
I agree with the fact that it is futile to try to beat these companies at their own game. That will never happen by a single library. They have more data and they have something that might be better than all the other kinds: intentional data. They can build their interfaces based on how people vote with their wallets through purchasing data from Amazon or library holdings data at OCLC. They can follow the money and we cannot.
However, this is not to say that libraries, and academic libraries in particular, do not have a niche in the information market. It is crucial for library systems developers to understand that libraries build collections. We make deliberate, careful and researched choices about what goes into the collection. We don’t have all the data at our disposal precisely because we don’t have unlimited budgets, so if we are doing our jobs well, we are only selecting the good materials for our collections.
Libraries can build a research map with a next generation library catalog. A good collection is defined not simply by the fact that it contains multiple items, but because their is some cohesion among the items collected. One thing that I cannot understand is why people cannot look past the physical containers of information objects. For an information collection, the cohesion which makes it worthy of the effort needed to build and sustain it is not based on the fact that they are all physically available items at the researcher’s disposal. That would simply make it a collection. What makes the collection good is the fact that it represents both a breadth and depth of knowledge required to conduct research. Or simply put, it contains good information.
I imagine a research process that is more like a partnership of the big research tools (Amazon, Google, OCLC) with the local library’s online research tools. In his discussion of the way that Wikipedia functions as a probability-based system Chris Anderson wrote in The Long Tail, “Wikipedia should be the first source of information, not the last. It should be a site for information exploration, not the definitive source of the facts.”
My take on the current state of research is similar. In the beginning of the research process it is advantageous as Nate said to go to a source that is not limited by physical geography. However, I think there are efficiencies that can be gained if libraries can get involved in the later stages of the research process.
After finding one good item at Amazon, you are offered “more like this” because someone wants to make a buck by selling you two books rather than one. At a library, where the collection has presumably been carefully selected, if you find the one good book you have a greater chance that the “more like this” offerings will also be more good information.
If Amazon wants to make another buck, what is our motivation? In the university environment in particular, we participate in the original reputation economy. A university employee inherits status from the status of the university. The university’s reputation is based on the quality of the research and scholarship it produces. Thus if I want to improve my reputation as a librarian, I have every incentive to make sure my researchers are finding quality information that makes their academic work as sound as possible.
In essence, I want to select and then make available a great collection. While libraries have been doing a great job building the collection, we are only now beginning to see how much work still needs to be done building systems that showcase those collections.
9
2007What I want from a catalog
0 Comments | Posted by nate in Cataloging/Classification, Technology
It’s been a while since I’ve thought about what, in my mind, electronic catalogs are supposed to do. Today, Steve sent me a link to a test version of a very elegant catalog app built with a fraction of our catalog data. It really brings the cataloging data (you know, that stuff that librarians worked so hard to create) to the forefront, and has a great “shelf browse” view. This (plus this OPAC survey posted to code4lib) got me thinking: what should our catalog be, really?
It’s easy to get all Web 2.0 starry-eyed about this, perhaps partly because our catalog has been so ghastly for so long. People talk about social recommendations, comments, tags, structured blogging, and so on. There are a few problems with going down this road, though:
- Other people are alerady doing this, well, and for free.
- The Information Superhighway is littered with the charred-out husks of failed social networks. (Did you know Amazon added tagging a year ago? Have you ever used it?)
- Library catalogs, by definition, contain only your library’s stuff.
The first two points might be surmountable (and are really the same thing anyhow), but the third is the killing blow to any idea of catalog-as-research-tool. Amazon has more data than you. Google Books has more data than you. Worldcat has more data than you. The thing you need to do your research may be at someone else’s library; this is why we have ILL, after all. Using the OPAC to do research means you’ll miss out on everything that’s not local. We can’t fix that. All of the social networking, “More about this book,” “More books like this,” and so on are all based on using the OPAC as a research tool. We just shouldn’t do that.
The place where our catalog can excel, the place where no one can compete, is in finding things already in our collection. Try using your Voyager-based catalog to find out where a particular book (or journal volume) is. Want extra credit? Try finding a NASA technical report. For some stuff, it’s nearly impossible to do, even for librarians. The number of times I’ve heard a librarian say “Well, I just know this is probably over here…” makes me want to scream. We’re using a catalog that indexes all of our millions of things so badly that our librarians often need to ask other librarians to help find things that are sitting on a shelf or in a file drawer.
It’s shameful.
So… I’m happy to wait on all of the Web 2.0 goodness until we’ve mastered the Web 1.0 thing.