learns_to build Academic Archive::Part 1:Introduction, Concept, and Design

18 August, 2006

At this year's FOSCON, Amy Hoy issued a clarion call to the elite Ruby hackers in the room: help the newbies! With the spectacular recent growth of Ruby and, especially, Rails there's a great and growing need for educational resources and infrastructure to help newcomers get acclimatized.

Since then, I've had a few ideas that might help. The first I blogged a while ago: The How-I-Learned-Ruby Quiz. The second one is more ambitious. I'd like to introduce it here.

One of the most useful experiences for me in the process of learning Rails was working side by side with a more advanced coder on a real project all the way from the first design sketches through the deployment of a working app. While the Agile Book tries to provide a version of it, this is an experience that is almost wholly unavailable to newbies. Most beginners are stuck trying to puzzle their way through with reference books, source code, and blogged code snippets. While these are sufficient for the experienced coder simply trying to pickup the new hip language, they just don't get the job done for true programming newcomers.

So, I therefore propose to provide some simulacrum of that experience here on this blog. I've got a project I want to build in Rails and as I'm doing so, I'll try to give you a view over my shoulder. I'll write about the practicalities and the philosophy behind each step. Eventually, I hope to make the code I write available through a public repository so you can follow along and even help out (as you'll see shortly, the app is public-minded and will, eventually, be open source).

To complete this introduction, I'm going to take you through all the steps I've gone through so far. . .everything up to actually writing code. First, I'll lay out the concept for the app, summarizing its purpose and aims. Then, I'll talk about design. In many ways the hardest and most interesting part of building a web app, design is the process by which we translate the real world inhabited by the app into abstracted models and relationships we can represent in code. I'll start with the basic screens I imagined when first thinking about how to make this website and then proceed through the first two iterations of "arrows and boxes" I've come up with so far.

Concept: Academic Archive1

Scholars of every rank and level regularly research and write papers which never see publication. Whether written by undergraduates or tenured professors, by amateur local enthusiasts or internationally renowned experts these papers represent a great wealth of research, insight, and argument which remains inaccessible to the wider community of scholars as well as the interested public at large.

Academic Archive seeks to provide a platform for publishing these papers on the web in order to make them universally accessible. The Archive will accept submissions from anyone regardless of qualifications. The Archive's back end will allow a network of volunteers to undertake cursory screening of submitted papers for plagiarism and to ensure that they meet a basic level of quality. The Archive's public website will organize and index these papers for convenient search and browsing.

Design

I don't know about other programmers, but when I'm first brainstorming about an idea for a new web app, the part of it that I can picture in my head is the screens. I can't necessarily see the specific style of how they look, but I can kind of get a sense of what different roles they'll have to play. It's like imagining your dream house. You might not know what color you'll paint it when it's done, but you know you want a hot tub, a racquetball court, a formal dining room, etc.

Anyway, here are the screens I first pictured for Academic Archive:

Author Upload Page
This will be where the whole process starts. Users will come here to upload their papers so obviously it will need a file upload form. Most users probably have their paper in Microsoft Word format, so we'll need to make some decisions about how to prepare papers for the web, In the long run we'd like allow them to be able to accept actual Word files and to process them into HTML ourselves. This would make things easiest for the users and allow us to ensure the best markup for our articles. In a first iteration, though, our goal is to eliminate or put off as much complexity as we possibly can, so we'll probably only accept papers already formatted as HTML (so this page will probably also have some instructions on how to convert Word files).
Editor Approval Page
If you're one of our volunteer editors, this will be your home base. You can see a queue of articles awaiting your approval. From here, you can read each of the articles and approve or reject them for publication.
Index of Papers
This is really a whole section of the site dedicated to browsing through lists of the published papers and searching for information contained in them. It's got a front page with either the most recent papers or some other selections. If you're a visitor, it takes you from arriving at the site all the way up to the point of clicking on a paper to read it. I haven't given this section too much thought since it is the least specific to the particular problem we're trying to address. Lot's of other sites on the web present browsing and searching interfaces and, at least to start out with, I'll probably steal one of those that I think is good.
Individual Paper View
This part couldn't be simpler. The papers come in as HTML and all we've got to do is remember where we stored those files and point the readers at them. Additionally, we may want to provide some location for people to discuss each paper, but again that's not within the scope of what we're doing right now. We're just trying to find simplest site we can build that will solve the problem as we set it out in the Concept. Other feature ideas are great and we'll try to keep them in mind as they come up so we don't make any design decisions that rule them out, but right now they are a dangerous distraction from getting the app onto a solid foundation so we'll put them aside.

Now that we've done some basic thinking about the types of things our app needs to do it's finally time to start thinking about how it's going to do them. That mean its time to "model our domain". Domain modeling is an incredibly deep subject and there are an endless number of books on the subject. In fact, I'm reading the one I hear is the best right now. In a nut shell, domain modeling is the process of building up in code a representation of the parts of the real world pertinent to your problem. The idea is to install in your program abstractions of the people and things you're working with (in our case authors, editor, papers, etc.) and to tie them together into the proper relationships. It's kind of a hard process to get a grasp on, but it will quickly come much clearer if we start work on our specific case.

I made my first design sketch at a pub while waiting for a friend to perform. The Concept was brand new and I was psyched about it. I was drinking a beer. Here's what I sketched in my moleskine:

Academic Archive first design sketch

The first thing I drew, and the part I was most confident about was the author-authorship-paper trio of models. I was confident about this idea because I stole it. (A word about the notation here: a word represents a model, simultaneously a particular type of thing or person we're trying to represent as well as an actual class in our code. The lines represent relationships between them, the stars a "many" relationship on one side about which more in a minute.)

These three models are trying to represent the idea that an author "owns" a paper. That is, that an author has_many papers and a paper belongs_to one author (when I mean these relationships in their technical Rails/relational database sense, I'll write them with underscores like this as they'll appear when they are actual Rails methods. One of the things that's so nice about Rails is the way it's natural syntax let's you kind of slide gradually into from natural language).

So where does authorship come in? Authorship is called a "join model" because it mediates the connection between authors and papers. Instead of asserting that an author has_and_belongs_to_many papers, we'll say that an author has_many authorships and has_many papers through authorship. Join models are helpful in a number of ways. How? Well, there were a couple of things wrong with the author-paper relationship we set out above. First of all, what if a paper has more than one author? In order to model this we've got to say that a paper has_and_belongs_to_many authors and vice versa, which, in the code, means adding a lot of difficulty to the average case just to handle some complexity which shows up on exceptional cases (papers with multiple authors). Never a good idea. Secondly, a join model lets us assign attributes to the relationship between the two things that it joins. So, with the authorship model, we could capture the idea that two authors don't have the same status on a paper, i.e. that one is a research assistant or something. That would be a very hard situation to handle with a normal has_and_belongs_to_many relationship.

What else is going on in this sketch? Well, down the right we've got some lines connecting author to person and from thence to editor with the inscription "STI?" nearby. The idea there was the following: we've got authors and we've got editors. But really, both of those are just different types of people. Single Table Inheritance (or STI) is a pattern that allows you to capture multiple roles that might be played by a certain type of entity while retaining the attributes that are always common across those roles. A common example might be people in a company: the same person could simultaneously be a manager, an employee, and a member of a committee. No matter what role they played they would still probably have a name, contact info, etc. so keeping them in a common table a lot of work could be saved. I won't go into too much depth on STI. When you get to the next design iteration you'll see why I decided to abandon it (or at least postpone thinking about it until later).

How far along was I after making this sketch? I had a pretty good list of the nouns (author, paper, editor, category) but not a very clear sense of their relationships. I knew the authorship join was a good idea because of having seen other smarter people model that exact same situation before. But I didn't really know how papers got into categories or what relationship editors had with them. You can see me brainstorming some ideas for how to solve these problems in the notes at the bottom of the page. I was trying to figure out how papers get assigned to editors for approval. I came up with the vague notion that papers get assigned to editors through categories, writing that papers "can be approved or not, etc. in many different categories." Though only beginning to come into focus, this idea turned out to be the key to unraveling the whole problem. But to get there, I needed help. So I brought in Chris.

On a lunch break from work one day, sitting at an outside table on Morrison between 10th and 11th eating mezzas at a Lebanese restaurant, I pulled out my notebook and started telling Chris about my design for Academic Archive. Very quickly he asked a number of highly clarifying questions and helped me tease out a much more robust design. Here's the sketch I made that day:

Academic Archive second design sketch (with Chris)

The author-authorship-paper relationship is there, but now there are a couple of whole new concepts on the board: approval and editorship. The main idea here is for how categorization would work. In plain language the idea is that a paper could be submitted for approval in any number of different categories. It would then gain membership in each category by gaining approval from each category's editor. So, for example, I might submit my thesis, It's Not Just Academic: The Academy of Motion Picture Arts and Sciences, in both Art History and Film Studies. It would then be subject to approval by two different editors and could end up published in both categories, one, or neither depending on what each of the editors thought of it.

How does the new modeling capture this concept of the paper-category-editor relationship? It does so with two overlapping join model relationships. First we added approval, which stands between papers and categories. A paper has_many approvals and has_many categories through approval (and likewise vice versa for categories). Like in the example of authorship, the presence of this join model gives us an opportunity to hang attributes on the relationship. Here, we'll likely want to keep track of which editor issued the approval and when it took place. Actually, if you look at the diagram that attribute will take the form of a full on relationship. An approval will belong_to an editor. And, in fact, approvals will join papers all the way to editors as well as to categories. This will make it a breeze to figure out all the papers an editor has approved. And, thinking about it a little more, the approval life cycle will likely be the spot where most of the action on the Editor Approval Page will take place. Not to get too deep into implementation details, but I can imagine a scenario where creating a new approval assigns a paper to its editor who then marks it as approved or rejected. We'll have to think this through more precisely at a later point, but it's probably good news that this structure seems so rich even at this early stage.

The one other thing to note before we move on to editorship is the fact that editors have a relationship to papers that is separate from categories. This seems like a good thing since it's easy to imagine a situation where category editors come and go over time. Keeping those relationships separate will mean that we can keep an accurate record of which editor actually approved a paper for a category rather than only knowing the current editor of the paper's category. Without this separation it would be really easy to lose track of the simple factoid: who approved this paper?

Our second join model, editorship, looks a lot like authorship. It's how editors gain the ability to approve papers for categories. It will be really easy to list the categories for which an editor has approval power -- handy when you're trying to build the Editor Approval Page.

What outstanding questions does this design leave us with? Well, beyond editor and author there's no larger sense of a person or a user. Like we thought of the first time through, at some point we're going to have to provide a common ancestor for editors and authors. It will be the place we'll stick the user's personal details as well as their authentication information. That stuff is easy to leave out for the moment since it's both totally unrelated to the specifics of our domain and easy to bolt on later with a third party plugin like acts_as_authenticated. More substantially, we'll be using the idea of a user to make sure that an editor isn't assigned to approve a paper she authored. That's an important rule to capture and I'm pretty sure our design makes it possible, but in the name of limiting complexity, I'm going to stick a pin in this issue and come back to it later once things are more real.

The other big issue we have is that there's no place to do admin type activities: how does an editor gain permission to add an editorship in a new category? Who is allowed to create categories? Again, we're aware that at some point we'll probably need an admin model which is related to our concept of users in the same way that our author and editor models are. Again, this will probably all be done with Single Table Inheritance. And again, we're going to put it off for a little while.

Well, it's starting to look like we have a pretty good feeling for how to build this app. Enough to get started anyway. While these unknowns we just reviewed might be disconcerting, in my experience they're pretty par for the course. What we need to do clarify them is to actually build some part of the app for real. If we do that right, it will provide a concrete basis for our thought process about these continuing questions and may even point us in the right direction for a solution.

What part of the app should we build first? When I look at the diagram, I want to start with the paper-approval-category relationship and, specifically with papers. That's the one zone that involves only objects that are unique to our domain; there are no accounts or plugins or anything else external involved at all. Plus, the heart of this app is taking uploaded papers and putting them on the web. If we get that right everything else should fall into place. Or at least, that's what we hope.

  1. This idea is and the resulting project will be a collaboration with Jem, my cousin and one of my partners in MFDZ. []

Tagged: , , , , , ,