Know Yourself First

del.icio.us TRACK TOP
Posted on February 3rd, 2008 by Steve. Filed in News, Technology.
No comments yet. Be the first to comment!

It does not matter that Microsoft may buy Yahoo–the acquisition is based on a flawed premise. Technology companies cannot operate like the GEs and General Motors of the world and serve as the be-all-end-all of technology. The New York Times today put the acquisition in the right context. Describing the business culture of Silicon Valley, they write:

The economist Joseph Alois Schumpeter had a name for this principle of capitalism: creative destruction. Perhaps nowhere does it play out more dramatically — and more rapidly — than in Silicon Valley, where innovation unleashes a force that creates and destroys, over and over.

Technology companies are susceptible to creatively destructive forces when they try to expand too far beyond their original mission. Technologies like computer programming can only be successful if they break problems into smaller pieces that individually solve only a single component of the larger goal. At the time of writing, a computer programming function is defined by the masses (Wikipedia) as “a portion of code within a larger program, which performs a specific task and can be relatively independent of the remaining code” (my emphasis). This principle of modularization at the most basic level of contemporary information technology is important to a technology organization’s business model.

Microsoft and Yahoo both fail so horribly at the world of search and Internet advertising because those problem domains lie at the heart of neither companies’ core service: the operating system/desktop platform and the Internet portal. The reason Google so thoroughly dominates the world of search and Internet advertising is because that is its only core. Everything it does revolves around this core service and all of its activities support this model. The moral of the story is that you must choose your core, your identity and your raison d’être and you must choose it wisely because trying to be all things to all people is a futile exercise.

What does this mean for libraries? In the techie realm of libraries, an institution needs to determine what its core mission is and decide how it will define itself in a world of creative destruction. It will need to be able to clearly and succinctly articulate what those goals are to its affiliate institutions: universities or local governments. The library must not try to do everything; as the current computing paradigm of APIs and web services demonstrates, technology works when it is implemented singularly and exceptionally, but in a manner that is open and unafraid of sharing its data and services.

And finally, the modern library must not be afraid to get in the game and take a turn at trying to creatively destroy the old guard, lest it fall prey to the fate of the Yahoos of the world.

Google influenced by librarians?

del.icio.us TRACK TOP
Posted on December 17th, 2007 by Steve. Filed in Culture, Technology.
No comments yet. Be the first to comment!

The New York Times has a short piece on a new Google service called Knol that sounds like it could have been conceived by librarians:

“We believe that knowing who wrote what will significantly help users make better use of web content,” wrote Udi Manber, vice president of engineering, on the official Google blog.

The service appears to be a wiki-style hosting service that puts a premium on identifying authorship.

Modeling Things or Revealing Things

del.icio.us TRACK TOP
Posted on November 23rd, 2007 by Steve. Filed in Cataloging/Classification, Technology.
No comments yet. Be the first to comment!

Karen Coyle has a great piece on Hierarchies vs. Relationships in bibliographic modeling. She points out that the point of the FRBR model is not so much the hierarchy that you get to model, but the relationships that you can reveal among things.

This is a keen insight in my view since it really begins to get at the fun stuff that the Googles, Amazons, etc are doing with data that libraries long to do with bibliographic data. Coyle starts to articulate something here that I have not been able to put my finger on: the way that FRBR is a huge step forward but still only has an eye toward an implementation rooted in the way libraries have traditionally done things.

My library right now has been in discussions about subject guides and how to best build and provide access to them. I have felt for some time now that it would be great to get out of a next-generation catalog a system that imparts the kind of knowledge our librarians and subject liaisons put into these projects. Coyle’s post renewed this thought by framing the new catalog model in terms of a “Knowledge Management system,” which to my mind is the true aim of a discovery system.

In the past when I have tried to express a hybrid of a next-generation catalog and a subject discovery tool, I have always framed it in terms of applying graph theory to bibliographic data. I think Coyle’s post helps me to understand this. It seems obvious to use subject terms and call number ranges as one type of edge/vertex for nodes which are bibliographic items. However, her discussion raises the possibility of a new set of different kinds of edge types: translations, abridgements, extensions, etc.

More on this later…

I’ve been busted!

del.icio.us TRACK TOP
Posted on October 17th, 2007 by nate. Filed in Culture, Technology.
No comments yet. Be the first to comment!

Unless Karen Coombs is writing about some other reference statistics tracking package that has an (until recently) undocumented dependency on Pear::DB, her blog post calls out one of the (numerous) failings of Libstats: Installation is difficult for a lot of people. I get a lot of questions from people who have trouble with mod_rewrite or don’t know DB is required or various other things.

I’ve had similar negative experiences with open-source software, and actually releasing something gave me a much better understanding of why things wind up like this.

A few years ago, our library decided to write a reference tracking system and pilot it at a few libraries across campus. Since I was, then, the only developer at our library, the task fell to me. Once the system had proven successful at Madison, I thought, “Hey, maybe other people would like this, too.” I got the OK from my boss to release the code under an open-source license.

This, it turns out, is tricker than it might seem. All of those steps I’d fumbled through to make the software run, I had to eliminate, or at least explain, to people installing this software on the servers they have on hand. Databases need to be created and populated with initial data. Web servers need to be configured. Did I want to provide a demo? Screenshots? Big software projects provide installation wizards, but writing those is a bunch of work, and from my boss’s perspective, the software was written and done, and I had other projects to work on.

Then, there were concerns over the quality of the code. There’s some ugly shit in there. Did I really want people looking at that, and pointing and laughing? What if there’s a security bug in the code that could compromise someone’s server? Even if it relies on server misconfiguration, I’d feel pretty lousy if my code got someone hacked. How will people find out about, obtain, and install patches? Seriously, I wondered, is it even worth the work it’s gonna take to release this code?

Finally, I decided that it was worth the work, and that I’d release it, warts and all, in the hopes that it would be useful to some people. In the time since then, I’ve realized that the motivations of an open-source developer are different from that of a commercial project manager. I don’t get any reward from wide adoption, except a warm fuzzy feeling inside and possibly bragging rights if I make something exceptionally neat.

The bottom line: There’s a large cost and a limited benefit to making an open-source project into an open-source product, and that work will never ever happen as long as the project is only used internally — it’s not needed.

Here’s the question, then: Is it better to release something half-baked, in the hopes that it will be useful, or to keep it purely internal and let someone else solve the problem?

(On the particular topic of not documenting the Pear::DB requirement: when Libstats was released, DB was part of the standard PHP install, so this wasn’t a common issue. Reworking the code to use Pear::MDB is the right option, but that’s nontrivial.)

How to implement OpenID with Pubcookie

del.icio.us TRACK TOP
Posted on August 7th, 2007 by nate. Filed in Technology.
No comments yet. Be the first to comment!

Pubcookie is pretty neat. It lets you authenticate against a login server without ever personally seeing the user’s password — it’s all handled via clever web server modules, redirects, and the REMOTE_USER variable. But, when you go to build a web app with it, you’ll likely find yourself pining for session-based logins. Fortunately, it’s easy to build an OpenID service that’s backed by Pubcookie. Here’s how:

What you need

  1. A web server with working Pubcookie authentication.
  2. An OpenID server. I had good luck with PHP-OpenID, and I’ll be using their example server in this post.

Set up your identity URLs

OpenID identity URLs are what people enter in OpenID login boxes around the net. The pages they point to aren’t anything special — in the simplest case, they just need to have a link to your OpenID server (also called a ‘provider’). It’ll look like:

<link rel="openid.server" href="http://example.edu/op/server.php" />

I used Apache’s mod_rewrite such that all URLs of the format:

http://example.edu/id/<username>

Would be valid identity URLs, linking to an identity provider service.

Note: Your identity URLs don’t need to be served over HTTPS, and they must not be protected behind Pubcookie.

Set up the OpenID provider

Follow your package’s installation notes, and get one statically-defined identity URL working. Also test to make sure the other OpenID identity URLs you’re providing don’t work.

If you’re looking for a place to test URLs, try this OpenURL test service. Your provider URL can’t be behind a firewall or protected by Pubcookie — other web servers need to talk to it.

Make note of the name of the session key your OpenID library is using. By default, PHP-OpenID uses openid_server. You’ll need it in the next step.

Make Pubcookie set a session variable

Here’s the magic step. You need a script, protected by Pubcookie, that puts the value of REMOTE_USER into your session (remember, your provider can’t be behind Pubcookie) and redirects you to your OpenID provider’s login URL. Since no one can view this script without authenticating via Pubcookie, and this script is the only place this session variable can be set, you need to go through Pubcookie to set this variable.

I put this script in http://example.edu/op/pubcookie/index.php:

session_name('openid_server');
session_start();
$_SESSION['pubcookie_user'] = $_SERVER['REMOTE_USER'];
header("Location: http://example.edu/op/server.php/login");

Hack your OpenID provider to respect the session

Here, you want to find the code in which authentication is checked, and replace it with a check for the session variable you set above. In this example, I replaced action_login() in actions.php with:


function action_login() {
if (isset($_SESSION['pubcookie_user'])) {
$info = getRequestInfo();
$openid_url = "http://example.edu/id/".$_SESSION['pubcookie_user'];
setLoggedInUser($openid_url);
return doAuth($info);
}
else {
return login_pubcookie_render();
}
}

I also added login_pubcookie_render() to render/login.php — it simply uses redirect_render() to send visitors to the pubcookie-protected page. Anywhere else in the code you’re showing the login page, use login_pubcookie_render() instead.

Finally, you’ll want to do a check in the method that actually does the authentication to make sure the identity URL matches with the Pubcookie username — you don’t want people to use their own credentials to log in as someone else. In common.php, I added a check to the start of doAuth():

if ($req_url != $user) {
return login_pubcookie_mismatch($user, $req_url);
}

And added a login_pubcookie_mismatch() method to login.php, which warns that their username and URL don’t match, and that they should fix that situation.

Log out of everything and give the OpenID test a try. It should redirect you to your Pubcookie login system, and from there, to a working ID.

Libstats 1.0.4 - Security Release

del.icio.us TRACK TOP
Posted on June 13th, 2007 by nate. Filed in Culture.
No comments yet. Be the first to comment!

Hi there,

Y’all will want to download & install the latest version of Libstats. This version fixes a security bug that will affect anyone running with register_globals on in their PHP setup.

Libstats 1.0.3 released

del.icio.us TRACK TOP
Posted on May 1st, 2007 by nate. Filed in Culture.
No comments yet. Be the first to comment!

Yup, it’s time for that very occasional release of Libstats. Changes in this version:

  • On the report form, times are honored in the range fields
  • asked_at is now on the data dump report
  • question_time, question_weekday, etc now reflect when a question was asked, not when it was entered
  • ‘All Libraries’ is now an option for reports.

Enjoy!

An alternate view on the catalog’s purpose

del.icio.us TRACK TOP
Posted on April 9th, 2007 by Steve. Filed in Collection Management, Technology.
No comments yet. Be the first to comment!

I have to strongly disagree with what I saw as Nate’s primary point in his last post, What I want from a catalog. First, he pointed out that, “Library catalogs, by definition, contain only your library’s stuff,” and went on to conclude that this “is the killing blow to any idea of catalog-as-research-tool.” The primary argument is that a library can never compete with the amount of data amassed by the likes of Google or Amazon or Worldcat.

I agree with the fact that it is futile to try to beat these companies at their own game. That will never happen by a single library. They have more data and they have something that might be better than all the other kinds: intentional data. They can build their interfaces based on how people vote with their wallets through purchasing data from Amazon or library holdings data at OCLC. They can follow the money and we cannot.

However, this is not to say that libraries, and academic libraries in particular, do not have a niche in the information market. It is crucial for library systems developers to understand that libraries build collections. We make deliberate, careful and researched choices about what goes into the collection. We don’t have all the data at our disposal precisely because we don’t have unlimited budgets, so if we are doing our jobs well, we are only selecting the good materials for our collections.

Libraries can build a research map with a next generation library catalog. A good collection is defined not simply by the fact that it contains multiple items, but because their is some cohesion among the items collected. One thing that I cannot understand is why people cannot look past the physical containers of information objects. For an information collection, the cohesion which makes it worthy of the effort needed to build and sustain it is not based on the fact that they are all physically available items at the researcher’s disposal. That would simply make it a collection. What makes the collection good is the fact that it represents both a breadth and depth of knowledge required to conduct research. Or simply put, it contains good information.

I imagine a research process that is more like a partnership of the big research tools (Amazon, Google, OCLC) with the local library’s online research tools. In his discussion of the way that Wikipedia functions as a probability-based system Chris Anderson wrote in The Long Tail, “Wikipedia should be the first source of information, not the last. It should be a site for information exploration, not the definitive source of the facts.”

My take on the current state of research is similar. In the beginning of the research process it is advantageous as Nate said to go to a source that is not limited by physical geography. However, I think there are efficiencies that can be gained if libraries can get involved in the later stages of the research process.

After finding one good item at Amazon, you are offered “more like this” because someone wants to make a buck by selling you two books rather than one. At a library, where the collection has presumably been carefully selected, if you find the one good book you have a greater chance that the “more like this” offerings will also be more good information.

If Amazon wants to make another buck, what is our motivation? In the university environment in particular, we participate in the original reputation economy. A university employee inherits status from the status of the university. The university’s reputation is based on the quality of the research and scholarship it produces. Thus if I want to improve my reputation as a librarian, I have every incentive to make sure my researchers are finding quality information that makes their academic work as sound as possible.

In essence, I want to select and then make available a great collection. While libraries have been doing a great job building the collection, we are only now beginning to see how much work still needs to be done building systems that showcase those collections.

What I want from a catalog

del.icio.us TRACK TOP
Posted on April 9th, 2007 by nate. Filed in Cataloging/Classification, Technology.
1 comment filed

It’s been a while since I’ve thought about what, in my mind, electronic catalogs are supposed to do. Today, Steve sent me a link to a test version of a very elegant catalog app built with a fraction of our catalog data. It really brings the cataloging data (you know, that stuff that librarians worked so hard to create) to the forefront, and has a great “shelf browse” view. This (plus this OPAC survey posted to code4lib) got me thinking: what should our catalog be, really?

It’s easy to get all Web 2.0 starry-eyed about this, perhaps partly because our catalog has been so ghastly for so long. People talk about social recommendations, comments, tags, structured blogging, and so on. There are a few problems with going down this road, though:

  1. Other people are alerady doing this, well, and for free.
  2. The Information Superhighway is littered with the charred-out husks of failed social networks. (Did you know Amazon added tagging a year ago? Have you ever used it?)
  3. Library catalogs, by definition, contain only your library’s stuff.

The first two points might be surmountable (and are really the same thing anyhow), but the third is the killing blow to any idea of catalog-as-research-tool. Amazon has more data than you. Google Books has more data than you. Worldcat has more data than you. The thing you need to do your research may be at someone else’s library; this is why we have ILL, after all. Using the OPAC to do research means you’ll miss out on everything that’s not local. We can’t fix that. All of the social networking, “More about this book,” “More books like this,” and so on are all based on using the OPAC as a research tool. We just shouldn’t do that.

The place where our catalog can excel, the place where no one can compete, is in finding things already in our collection. Try using your Voyager-based catalog to find out where a particular book (or journal volume) is. Want extra credit? Try finding a NASA technical report. For some stuff, it’s nearly impossible to do, even for librarians. The number of times I’ve heard a librarian say “Well, I just know this is probably over here…” makes me want to scream. We’re using a catalog that indexes all of our millions of things so badly that our librarians often need to ask other librarians to help find things that are sitting on a shelf or in a file drawer.

It’s shameful.

So… I’m happy to wait on all of the Web 2.0 goodness until we’ve mastered the Web 1.0 thing.

Generation G

del.icio.us TRACK TOP
Posted on March 15th, 2007 by Steve. Filed in Culture, Technology.
2 comments filed

In the past couple of weeks I have had casual discussions with colleagues about the surge of Google in the university sphere. For example, our library is involved with the Google book scanning project. There have been other discussions around one of Google’s latest service offerings to universities: email.

Getting out of the email business is a very attractive proposition. It is a costly piece of infrastructure to support and it requires talented people to do well. This diverts some of our best human capital and technology resources away from other areas that are more specific to the university’s teaching, research and outreach domains. If the privacy issues surrounding financial aid decisions and other sensitive data can be resolved when storing this information on privately owned servers, it is very tempting for a university to get out of the email game.

However, it is important to ask the question about why a company like Google would want to get into the email business for universities. Here is a guess: having access to the email of university students offers a solution to the Generation X problem. To my mind, the lasting significance of the phrase “Generation X,” as well as its newer spinoffs “Generation Y” and “Millenials,” are the implications for business, advertising and marketing.

While the image of Kurt Cobain may constitute the tragic poster of Generation X both his in life and in his death, from the ‘follow the money’ perspective the phrase refers to a black hole in the advertising business. Advertising revenue goes in and yet no sales come out. Remember that Generation X, in part, was meant to describe a segment of the population, “twentysomethings,” that were unpredictable and therefore unreachable by marketers.

Children and teenagers are easy - just give them candy or some outlandish rebellious style. Middle aged folks are similarly easy - sell them big expensive things like sports cars or retirement packages. But people in their twenties have a lot of discretionary income from their first jobs and no major cash drains like mortgages and offspring. And yet marketers were failing to connect.

So let us take this discussion back to the original point: why is Google eager to take email over for universities. There are 2 significant pieces surrounding Google’s way of doing email:

  1. Google and similar companies do not make their money off of the immediate services. You don’t pay Google for searches you do at its website. Likewise, Google probably could not make a significant amount of money on selling an email service. How does Google make money? Advertising. They don’t make money off the service provided to an end user, email in this case. Instead they make money on the data they collect about end users during the service’s use.
  2. Google has a model for email vastly different from everyone else until everyone else began to ape Google’s model for email. The motto for their email service is, don’t delete, archive. Interviews I have seen and read with the Google founders also discuss their amazement that advertising has not caught up to the technology available in the 20th and 21st centuries. By harvesting large quantities of data, be it web pages or 2500 megabytes of email, you can deliver advertising to a person that is smarter and more likely to produce results.

Imagine that you could learn about and understand the personal associations and cultural references of an educated young adult just as s/he was heading out into the world to collect those first paychecks. What would you need to do? Well, one strategy would be to harvest their preprofessional communications that relate to their studies. Then you might combine that with harvesting personal and social communications among their peer groups. And doing this for, say, 4 to 5 years would provide a nice robust data set.

Now I do believe that Google is not going to sell any personal identities to advertisers or anyone else because losing their customer base as a result of what would essentially be corporate identity theft would be detrimental to their bottom line. However, you could build some rather impressive anonymous marketing personas that would be worth their digital weight in gold to advertisers.

I don’t think that if universities get out of the email business it is necessarily a bad thing, but I do think we need to be cognizant of what future we are contributing to. This is a technology decision that should be discussed with some of our best and brightest minds not only in the IT departments on our campuses, but also in our philosophy/ethics, business and sociology departments.