DataFrame Data Structure

Rica implements a new type, called DataFrame, for storing data that is structure, relational, and/or tabular. A DataFrame is a data structure that is indexed with respect to rows and associative with respect to columns, much like a table in a typical SQL database.

The Rica DataFrame type implements clojure.lang.Associative and clojure.lang.Indexed so that instances of DataFrame can be manipulated using the built-in functions that Clojure uses to manipuate its other data structres.

DataFrame Fields

The DataFrame type has two fields: a map of columns, and a schema.

A DataFrame’s map of columns is what holds the data stored in the data strucutre. There is one key-value pair for each column of data, where the key denotes the column name and the value is a Column type. A Column is Rica’s implementaiton of a typed vector, and is described more in the next section.

A DataFrame’s schema is used to denote 1) the order of columns and 2) the data type (class) of each column. The mechanism used to implement DataFrame schemas is an ordered-map.

DataFrame Columns

When operating on the columns of a DataFrame things feel a lot like a Map.

(require '[rica.core :refer [create-data-frame]])

(def users
  (create-data-frame :id [0 1 2]
                     :username ["alice" "bob" "eddie"]
                     :public-profile [false false true]))

(get users :username)
; #rica/Column ["alice" "bob" "eddie"]

(conj (get users :username) 5)
; Exception Cannot add class java.lang.Long to class java.lang.String column.

(conj (get users :username) "robert")
; #rica/Column ["alice" "bob" "eddie" "robert"]

(assoc users :column-with-nils [\a nil \c])
; #rica/DataFrame <:id(class java.lang.Long) :username(class java.lang.String) :public-profile(class java.lang.Boolean) :column-with-nils(class java.lang.Character) >

(contains? users :foo)
; false

(contains? users :id)
; true

DataFrame Rows

Although a DataFrame does not store data on a per row basis, it is often times needed to treat a DataFrame as an indexed data structure of rows.

The typical clojure functions which operate on indexed data structures will (ie nth, first, etc) will return a single row in the form of an ordered-map.

(second users)
; #ordered/map ([:id 1] [:username "bob"] [:public-profile false])

(nth users 2)
; #ordered/map ([:id 2] [:username "eddie"] [:public-profile true])

An entire DataFrame can be converted into a sequence of rows with seq.

(seq users)
; (#ordered/map ([:id 0] [:username "alice"] [:public-profile false]) #ordered/map ([:id 1] [:username "bob"] [:public-profile false]) #ordered/map ([:id 2] [:username "eddie"] [:public-profile true]))

(filter #(not (:public-profile %)) (seq users))
; (#ordered/map ([:id 0] [:username "alice"] [:public-profile false]) #ordered/map ([:id 1] [:username "bob"] [:public-profile false]))

The data-frame API in rica.core provides functions to turn sequences of rows back into DataFrame.