Muller's World

Mike Muller's Homepage

ODB Tutorial

This tutorial will walk you through basic use of the ODB storage layer and the higher level model module. It assumes a UNIXish system and basic proficiency with the UNIX environment.

Basic Storage

Let's start by creating a simple program to manage our e-mail groups.

    #!/usr/bin/python
    # store this as "maillist", and "chmod +x maillist"
    
    import sys
    import odb
    
    # get the database
    store = odb.Store('maildb')
    
    # get a "map" table
    groups = store.getMap('Groups')
    
    # get the command line arguments, decide what to do based on the first 
    # command.
    args = sys.argv[1:]
    cmd = args.pop(0)
    
    if cmd == 'putgroup':
    
        # get the group and all of the members, store them in the Groups table
        group = args.pop(0)
        members = args
        groups.put(group, members)
    
    elif cmd == 'rmgroup':
        
        # delete the group
        group = args.pop(0)
        try:
            groups.delete(group)
        except KeyError:
            print 'Group %s does not exist' % group
    
    elif cmd == 'lsgroup':
        
        # list the members of the group
        group = args.pop(0)
        members = groups.get(group)
        if members is None:
            print 'Group %s does not exist' % group
        else:
            for member in members:
                print member

The first step is to identify the database:

    # get the database
    store = odb.Store('maildb')  

The "Store" class is an ODB database. The string passed into it is the filesystem path where the database stores its files. This directory will be created if it doesn't exist, so there is no separate construction process for it.

If you run:

    $ maillist putgroup friends joey@yahoo.com frodo@middleearth.org

You should now have a "maildb" subdirectory in your current working directory.

It's probably a bad idea to store your databases on network filesystems: ODB only does fcntl style locking at this point, which isn't universally supported by network filesystems. So if multiple clients are using it simultaneously data corruption is possible.

We can list and delete elements in our database:

    $ maillist lsgroup friends
    joey@yahoo.com
    frodo@middleearth.org
    $ mallist rmgroup friends

Transactions

We often want to perform a set of database actions atomically: so that all of the actions either succeed or fail together.

As an example, let's say we wanted to keep a separate table of which groups every e-mail address was a member of. When adding a group, we could just do a series of puts:

    # our "groups per member" table
    groupsPerMember = store.getMap('groupsPerMember')

    # ... portions omitted ...

    # new code to store a member
    groups.put(group, members)
    for member in members:
    
        # see if the member is already in a group
        memberGroups = groups.get(member, [])
        if memberGroups is None:
            memberGroups = []
        
        # add the group to the list of groups for the member
        memberGroups.append(group)
        groupsPerMember.put(member, memberGroups)

But if we get an error after writing the group, but before writing all of the members, our database is in an inconsistent state: some of the members will have an incorrect list of groups. We just can't have that.

The way to avoid this is to enclose the entire update in a transaction:

    
    txn = store.startTxn()
    try:
    
        # store the group
        groups.put(group, members)
        for member in members:
        
            # see if the member is already in a group
            memberGroups = groupsPerMember.get(member, [])
            if memberGroups is None:
                memberGroups = []
            
            # add the group to the list of groups for the member
            memberGroups.append(group)
            groupsPerMember.put(member, memberGroups)
        
        # commit the transaction and clear it so that we don't abort it.
        txn.commit()
        txn = None
    finally:
        
        # if the transaction wasn't fully committed (and set to None) abort 
        # it, rolling back all changes.
        if txn: txn.abort()

We can do something similar for rmgroups: this is left as an exercise to the reader.

We've seen that the transaction pattern looks like this:

    txn = store.startTxn()
    try:
        ... do something ...
        
        txn.commit()
        txn = None
    finally:
        if txn: txn.abort()

This is a lot of syntax for something so common as defining a transaction. In order to make this a little less work, ODB provides a transaction function decorator that makes any function run in its own transaction:

    from odb import txnFunc
    
    @txnFunc
    def storeInTwoTables(obj):
        byName.put(obj.name, obj)
        byId.put(obj.id, obj)
        return obj.name
    
    # store the object in both tables in a single transaction
    name = storeInTwoTables(obj)

The code above is equivalent to the whole "try ... finally" wrapper above, but is much less verbose. Note that arguments and return values are respected.

If you need to access the transaction from within the function, you can use the getTxn() method:

    @txnFunc
    def doSomething():
        # get the current transaction
        txn = store.getTxn()
        ...

Transaction Annotations

You may want to store additional information in a transaction - like a timestamp, or the id of the user that committed the transaction. Transaction annotations allow you do to this:

    txn = store.startTxn()
    try:
        groups.put('clowns', ['bozo@circus.com', 'guffaw@barnumbailey.com'])

        txn.annotations['user'] = 'mmuller'
        txn.annotations['comment'] = 'adding group "clowns"'
        txn.commit()
        txn = None
    finally:
        if txn: txn.abort()

We can do the same thing from within a decorated transaction function as follows:

    @txnFunc
    def storeGroup(group, members):
        groups.put(group, members)
        txn = store.getTxn()
        txn.annotations['user'] = 'mmuller'
        txn.annotations['comment'] = 'adding group %s' % repr(group)
    
    storeGroup('clowns', ['bozo@circus.com', 'guffaw@barnumbailey.com'])

If we dump our transaction log using the dbDump utility, we can now see:

    $ dbDump maildb/log.000000001
    Txn {
      Annotations {
        'comment': 'adding group "clowns"'
        'user': 'mmuller'
      }
      _ReplaceAction {
        key = 'clowns'
        oldVal = None
        name = 'Groups'
        val = ['bozo@circus.com', 'guffaw@barnumbailey.com']
        gotOldVal = False
      }
    }

You'll currently have to do some digging into ODB's internals if you want to access the transaction logs programmatically. Hopefully, better support for this sort of thing will some day be added to the API.

Inspecting our Database

ODB provides some tools to allow us to look into the database without writing python code. In particular, "odbq" lets us perform arbitrary queries on the database:

    $ odbq -d maildb groupsPerMember/*
    frodo@middleearth.org ['friends']
    joey@yahoo.com ['friends', 'comrades']
    lester@nester.com ['comrades']

As you can see, while you were working on the rmgroup code, I added a "comrades" list along with my "friends" list :-). The query I used above was "groupsPerMember/*", which selects all keys in the "groupsPerMember" database.

We'll discuss odbq further later on when talking about the higher level "model" feaures.

Changes to the database are stored in transaction log files. "dbDump" can be used to view the contents of the transaction log:

    $ dbDump maildb/log.000000001
    [big transaction log dump omitted]

Sequence Tables

So far, we've only used "Map" tables - these are the most commonly used table types. But occasionally, you want to store data sequentially. For example, let's say we wanted to implement a persistent message queue:

    class Queue:
        
        def __init__(self):
            self.q = store.getSequence('queue')
        
        def add(self, message):
            # add the message to the end of the table
            self.q.append(message)
        
        def get(self):
            # pop the first message off the queue.
            return self.q.pop(0)

If we want to make sure that we were able to successfully process the message before removing it from the queue, we could wrap the processing in a transaction:

    txn = store.startTxn()
    try:
        msg = q.get()
        raise Exception('error processing the message!')
        txn.commit()
        txn = None
    finally:
        if txn: txn.abort()

Sequence tables are implemented using a special form of a btree which stores child counts instead of keys. You can expect O(log n) insertions and lookups.

Cursors

ODB lets you iterate over ranges of values in both map and sequence tables using cursors.

To list all of the groups in our Groups table, we could do this:

    for name, members in groups.cursor():
        print '%s: %s' % (name, members)

the cursor() method returns an iterator over its table. For a Map table, the iterator yields key/values pairs. For a Sequence table, it merely yields elements:

    queue = store.getSequence('queue')
    for elem in queue.cursor():
        print elem

Cursors can be positioned using setToFirst(), setToLast() and setToKey(). To print out all groups whose names start with the letter "f":

    cur = groups.cursor()
    cur.setToKey('f')
    for name, members in cur:
        if not name.startswith('f'):
            break
        print name, members

Note that map table keys are sorted lexically.

setToKey() defaults to a partial match - it finds the first key beginning with the specified substring. You can also find an exact match:

    cur.setToKey('foo', exact = True)

The setToKey() method also works on sequence tables, in this case the key is the index.

    # start iteration at position 10
    cur.setToKey(10)
    for elem in cur:
        print elem

It is often useful to traverse a table in reverse, so the cursor supports a reverse() method which returns a reverse cursor at the same position:

    # go backwards through our group list
    cur = groups.cursor()
    cur.setToLast()
    cur = cur.reverse()
    for group, members:
        print group, members

The semantics of cursors is conceptually the same as sequence indeces in Python: cursors can be conceived of as pointing to the space between two elements. So, for example, for any non-empty table:

    cur = table.cursor()
    first = cur.next()
    print first == cur.reverse().next() # always prints "True" (assuming
                                        # comparison works as expected)

This could yield unexpected results when dealing with reverse iterators:

    cur = groups.cursor().reverse()
    cur.setToKey('f')
    print cur.next() # prints the (group, members) _before_ the first group 
                     # starting with "f"

The Model Module

Everything up to this point has been focused on the storage API, which lets you store and retrieve objects from tables. This is all well and good, but in most applications there is a need for some higher level features. We typically want to be able to do things like compose keys from attribute values, or define indeces on tables. This is where the "model" module comes in.

Let's say we wanted to improve upon our mailing list example so that groups could contain hundreds of users. We might want to define a few objects like this:

    class Group:
        "Groups have an id and a description."
        
        def __init__(self, id, desc):
            self.id = id
            self.desc = desc
    
    class Member:
        "Members have a group and an e-mail address."
        
        def __init__(self, group, email):
            self.group = group
            self.email = email

The kinds of things that we want to do are the same as in the previous example:

  • Look up all members of a group

  • Look up all groups that an address is a member of.

We can do this by making our objects Model objects and defining schemas for them. Schemas define mappings between objects and their tables and indeces. They allow you to specify a list of object attributes to be used to define keys for the tables and indeces.

To make use of these features, we would rewrite our classes as follows:

    from odb.model import Model, Schema, WILD

    class Group(Model):
        "Groups have an id and a description."
        
        # groups are in table "Group", the key is the group id.
        _schema = Schema('Group', ('id',))
        
        def __init__(self, id, desc):
            self.id = id
            self.desc = desc

    def iterAllMembers(self):
        # iterate over the list of Member objects whose key starts with the 
        # group id
        for key, member in Members.select(self.id, WILD):
            yield member

    def addMember(self, email):
        Member(self.id, email).put()
    
    def removeMember(self, email):
        Member.get(self.id, email).delete()
                
    class Member(Model):
        "Members have a group and an e-mail address."

        # groups are in table "Group" (where the key is the group and e-mail) 
        # and are also indexed by email and group.
        _schema = Schema('Member', ('group', 'email'),
                         indeces = {'Email': ('email', 'group')}
                         )
                
        def __init__(self, group, email):
            self.group = group
            self.email = email

Index and table keys are defined by a tuple of attribute names that are used to compose the key values. These are sorted lexically in map tables, so if we want to be able to do something like iterating over all of the members of the group, we want the key to begin with the group id so that all of the members of the group can be found in a contiguous range within the table or index.

Keys must be unique. If you attempt to store an object with a key that is identical to that of another object, either in the main table or in any of the indeces, you will get a KeyCollisionError.

The Model.select() and Model.get() methods allow you to retrieve objects by key. Model.get() is used to retrieve a specific object given its key. Model.select() is a generator that allows you to iterate over a range of key/value pairs.

Both functions have similar keyword and function arguments: the sequence arguments are values for each attribute in the key. So for example, the primary key in our main table is ('group', 'email'), to look up an object using get we say "Member.get(group, email)".

TODO

  • storage cursors

  • odbq queries on model tables

  • filers and backups