« whats the most important web2 acquisition to date? | Main | Memewars »

Sorting a data.frame in R

I frequently find myself having to re-order rows of a data.frame based on the levels of an ordered factor in R.

For example, I want to take this data.frame:

	  product store sales
	1       a    s1    12
	2       b    s1    24
	3       a    s2    32
	4       c    s2    12
	5       a    s3     9
	6       b    s3     2
	7       c    s3    29

And sort it so that the sales data from the stores with the most sales occur first:

	  product store sales
	3       a    s2    32
	4       c    s2    12
	5       a    s3     9
	6       b    s3     2
	7       c    s3    29
	1       a    s1    12
	2       b    s1    24

I keep forgetting the exact semantics of how its done and Google never offers any assistance on the topic, so here is a quick post to get it down once and for all, both for my own benefit and the greater good.

First we need some data:

   productSalesByStore = data.frame(
      product = c('a', 'b', 'a', 'c', 'a', 'b', 'c'),
      store = c('s1', 's1', 's2', 's2', 's3', 's3', 's3'),
      sales = c(12, 24, 32, 12, 9, 2, 29)
      )

Now construct a sorted summary of sales by store:

   storeSalesSummary =
      aggregate(
         productSalesByStore$sales,
         list(store = productSalesByStore$store),
         sum)
   storeSalesSummary =
      storeSalesSummary[ 
         order(storeSalesSummary$x, decreasing=TRUE), 
         ]

storeSalesSummary should look like this:

	  store  x
	2    s2 44
	3    s3 40
	1    s1 36

Use that summary data to construct an ordered factor of store names:

   storesBySales =
      ordered(
         storeSalesSummary$store,
         levels=storeSalesSummary$store
         )

storesBySales is now an ordered factor that looks like this:

	[1] s2 s3 s1
	Levels: s2 < s3 < s1

Re-construct productSalesByStore$store so that it is an ordered factor with the same levels as storesBySales

   productSalesByStore$store =
      ordered(productSalesByStore$store, levels=storesBySales)

Note that neither the contents nor the order of productSalesByStore has changed (yet). Just the datatype of the store column.

Finally, we use the implicit ordering of store to generate an explicit permutation of productSalesByStore so that we can sort the rows in a stable manner:

   productSalesByStore = 
      productSalesByStore[ order(productSalesByStore$store), ]

And we are done!

ap.

Comments

A possible shorter way is

do.call('rbind', by(productSalesByStore, productSalesByStore$store, function(x) x[sort.list(x$sales, dec=TRUE),]))

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)