What `R` you? (R dataframe vs python dataframe)

Recap

Previously in this series, we discovered the equivalent python data structures for the following R data structures:

  1. vectors
  2. lists
  3. arrays/matrixes

In this post, we will look at translating R data frames into python. We will also compare and contrast data frames in R and python.

R data frame is a python

Pretty straight forward, a R data frame is a python data frame. We will use an in built data frame, OrchardSprays, for our illustration. This dataset is from a study, which examined the relationship on the uptake of sugar solutions by bees when different concentrations of pesticide were added to the syrup.

OrchardSprays has four variables:

  1. rowpos: The row position of the solution

  2. colpos: The column position of the solution

  3. treatment: The type of orchard spray added to the solution

  4. decrease: The decrease in volume of the solution at the end of the study which indicates the amount of syrup consumed by the bees.

library(tidyverse)
library(reticulate)
py_run_string("import pandas as pd")

OrchardSprays %>% select(rowpos, colpos, treatment, decrease) %>% head()
##   rowpos colpos treatment decrease
## 1      1      1         D       57
## 2      2      1         E       95
## 3      3      1         B        8
## 4      4      1         H       69
## 5      5      1         G       92
## 6      6      1         F       90

Let’s check the data structure of OrchardSprays in R.

class(OrchardSprays)
## [1] "data.frame"

And now let is check the data structure when it is translated to python.

r_to_py(OrchardSprays) %>% class()
## [1] "pandas.core.frame.DataFrame"       
## [2] "pandas.core.generic.NDFrame"       
## [3] "pandas.core.base.PandasObject"     
## [4] "pandas.core.accessor.DirNamesMixin"
## [5] "pandas.core.base.SelectionMixin"   
## [6] "python.builtin.object"

Data frames in python are governed by the pandas package.

List columns

Occasionally, columns in a data frame contain lists instead of atomic vectors. These data frames are also known as nested data frames. Let us see how python handles a nested data frame from R. First, we will create a nested data frame. Variables “rowpos”, “colpos”, “decrease” will be nested inside the type of treatment. The nested variables appear as a list column called “data”.

Orchard_nest<- OrchardSprays %>% nest(-treatment)

knitr::kable(Orchard_nest)
treatment data
D list(decrease = c(57, 36, 22, 51, 28, 27, 20, 39), rowpos = c(1, 4, 5, 8, 2, 6, 7, 3), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
E list(decrease = c(95, 51, 39, 114, 43, 47, 61, 55), rowpos = c(2, 5, 4, 3, 1, 7, 8, 6), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
B list(decrease = c(8, 6, 4, 10, 7, 4, 8, 14), rowpos = c(3, 2, 8, 7, 6, 5, 1, 4), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
H list(decrease = c(69, 127, 72, 130, 81, 76, 81, 86), rowpos = c(4, 3, 2, 1, 7, 8, 6, 5), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
G list(decrease = c(92, 71, 72, 24, 60, 77, 72, 80), rowpos = c(5, 8, 7, 6, 3, 4, 2, 1), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
F list(decrease = c(90, 69, 87, 20, 71, 44, 57, 114), rowpos = c(6, 7, 1, 5, 8, 3, 4, 2), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
C list(decrease = c(15, 84, 16, 9, 17, 29, 13, 19), rowpos = c(7, 1, 6, 4, 5, 2, 3, 8), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))
A list(decrease = c(2, 2, 5, 4, 5, 12, 4, 3), rowpos = c(8, 6, 3, 2, 4, 1, 5, 7), colpos = c(1, 2, 3, 4, 5, 6, 7, 8))

A nested data frame is still a R data frame.

class(Orchard_nest)
## [1] "data.frame"

Hence, it should be treated as a python data frame when translated.

r_to_py(Orchard_nest) %>% class()
## [1] "pandas.core.frame.DataFrame"       
## [2] "pandas.core.generic.NDFrame"       
## [3] "pandas.core.base.PandasObject"     
## [4] "pandas.core.accessor.DirNamesMixin"
## [5] "pandas.core.base.SelectionMixin"   
## [6] "python.builtin.object"

The dimension of a nested data frame in R and the translated data frame is the same.

dim(Orchard_nest)
## [1] 8 2
r_to_py(Orchard_nest) %>% dim()
## [1] 8 2

Inspecting the internal structure of the nested data frame, we can see for the column “data” there are nested variables within it .

str(Orchard_nest)
## 'data.frame':    8 obs. of  2 variables:
##  $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1
##  $ data     :List of 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  57 36 22 51 28 27 20 39
##   .. ..$ rowpos  : num  1 4 5 8 2 6 7 3
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  95 51 39 114 43 47 61 55
##   .. ..$ rowpos  : num  2 5 4 3 1 7 8 6
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  8 6 4 10 7 4 8 14
##   .. ..$ rowpos  : num  3 2 8 7 6 5 1 4
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  69 127 72 130 81 76 81 86
##   .. ..$ rowpos  : num  4 3 2 1 7 8 6 5
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  92 71 72 24 60 77 72 80
##   .. ..$ rowpos  : num  5 8 7 6 3 4 2 1
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  90 69 87 20 71 44 57 114
##   .. ..$ rowpos  : num  6 7 1 5 8 3 4 2
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  15 84 16 9 17 29 13 19
##   .. ..$ rowpos  : num  7 1 6 4 5 2 3 8
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8
##   ..$ :'data.frame': 8 obs. of  3 variables:
##   .. ..$ decrease: num  2 2 5 4 5 12 4 3
##   .. ..$ rowpos  : num  8 6 3 2 4 1 5 7
##   .. ..$ colpos  : num  1 2 3 4 5 6 7 8

However, when inspecting the structure of the translated nested data frame, the list column appears as an ordinary python column.

r.Orchard_nest.info(verbose=True) #ran using {python} code chunk. unable to run in {r} code chunk with py_eval nor py_run_string
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 8 entries, 0 to 7
## Data columns (total 2 columns):
## treatment    8 non-null category
## data         8 non-null object
## dtypes: category(1), object(1)
## memory usage: 584.0+ bytes

Also, when you print a nested data frame from R in python, the list column appears partially printed with misalignment.

py_eval("r.Orchard_nest") 
##   treatment                                               data
## 0         D     decrease  rowpos  colpos
## 0      57.0     1....
## 1         E     decrease  rowpos  colpos
## 0      95.0     2....
## 2         B     decrease  rowpos  colpos
## 0       8.0     3....
## 3         H     decrease  rowpos  colpos
## 0      69.0     4....
## 4         G     decrease  rowpos  colpos
## 0      92.0     5....
## 5         F     decrease  rowpos  colpos
## 0      90.0     6....
## 6         C     decrease  rowpos  colpos
## 0      15.0     7....
## 7         A     decrease  rowpos  colpos
## 0       2.0     8....

Tibbles

Tibbles are the modern cousin of traditional R data frames. Let us see how they are treated in python. First, we will convert a R data frame into a tibble. We will use another built in data frame in R, LifeCycleSavings. This dataset is about the savings ratio of different countries from 1960-1970.

LifeCycleSavings %>% head()
##              sr pop15 pop75     dpi ddpi
## Australia 11.43 29.35  2.87 2329.68 2.87
## Austria   12.07 23.32  4.41 1507.99 3.93
## Belgium   13.17 23.80  4.43 2108.47 3.82
## Bolivia    5.75 41.89  1.67  189.13 0.22
## Brazil    12.88 42.19  0.83  728.47 4.56
## Canada     8.79 31.72  2.85 2982.88 2.43

The output tbl_df indicates the structure is a tibble.

LCS_tibble<-as_tibble(LifeCycleSavings)
class(LCS_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"

One thing to note about tibbles is that they do not have row names thus the row names were dropped when the data frame were converted into a tibble.

LCS_tibble %>% head()
## # A tibble: 6 x 5
##      sr pop15 pop75   dpi  ddpi
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11.4   29.4  2.87 2330.  2.87
## 2 12.1   23.3  4.41 1508.  3.93
## 3 13.2   23.8  4.43 2108.  3.82
## 4  5.75  41.9  1.67  189.  0.22
## 5 12.9   42.2  0.83  728.  4.56
## 6  8.79  31.7  2.85 2983.  2.43

Now let’s see how python recognizes a tibble.

r_to_py(LCS_tibble) %>% class()
## [1] "pandas.core.frame.DataFrame"       
## [2] "pandas.core.generic.NDFrame"       
## [3] "pandas.core.base.PandasObject"     
## [4] "pandas.core.accessor.DirNamesMixin"
## [5] "pandas.core.base.SelectionMixin"   
## [6] "python.builtin.object"

The tibble is translated into a python data frame. Let us check that there are no glitches when python translates a tibble into a python dataframe.

py_eval("r.LCS_tibble.head()")
##       sr  pop15  pop75      dpi  ddpi
## 0  11.43  29.35   2.87  2329.68  2.87
## 1  12.07  23.32   4.41  1507.99  3.93
## 2  13.17  23.80   4.43  2108.47  3.82
## 3   5.75  41.89   1.67   189.13  0.22
## 4  12.88  42.19   0.83   728.47  4.56

The translated tibble prints the same as a R tibble.


Differences between R and python data frames (row names and indexes)

Intro

If the data frame in R has row names it will be maintained in the python data frame.

LifeCycleSavings %>% head() %>% row.names()
## [1] "Australia" "Austria"   "Belgium"   "Bolivia"   "Brazil"    "Canada"

Just that in python row names are termed as indexes.

py_eval("r.LifeCycleSavings.head().index")
## Index(['Australia', 'Austria', 'Belgium', 'Bolivia', 'Brazil'], dtype='object')

Creating row names and indexes

There are R data frames without row names such as this pre-installed data frame, presidential. This dataset is about the terms of the eleven USA presidents from Eisenhower to Obama.

presidential %>% head()
## # A tibble: 6 x 4
##   name       start      end        party     
##   <chr>      <date>     <date>     <chr>     
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy    1961-01-20 1963-11-22 Democratic
## 3 Johnson    1963-11-22 1969-01-20 Democratic
## 4 Nixon      1969-01-20 1974-08-09 Republican
## 5 Ford       1974-08-09 1977-01-20 Republican
## 6 Carter     1977-01-20 1981-01-20 Democratic

Occasionally, you may decide to create row names for your data frame using your own set of values or values from a variable from the data frame. Let us designate the values from the “start” column to be the row names.

presidential_names<-presidential %>%  mutate(start_term=start) %>%    column_to_rownames(var="start") %>% select(name, start=start_term, everything())

presidential_names %>% head()
##                  name      start        end      party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
## 1961-01-20    Kennedy 1961-01-20 1963-11-22 Democratic
## 1963-11-22    Johnson 1963-11-22 1969-01-20 Democratic
## 1969-01-20      Nixon 1969-01-20 1974-08-09 Republican
## 1974-08-09       Ford 1974-08-09 1977-01-20 Republican
## 1977-01-20     Carter 1977-01-20 1981-01-20 Democratic

And we will expect the row names to appear as index in python.

py_run_string("PyPresident= r.presidential_names") #renamed for the convenience of future use

py_eval("PyPresident.head()")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican
## 1961-01-20     Kennedy  1961-01-20  1963-11-22  Democratic
## 1963-11-22     Johnson  1963-11-22  1969-01-20  Democratic
## 1969-01-20       Nixon  1969-01-20  1974-08-09  Republican
## 1974-08-09        Ford  1974-08-09  1977-01-20  Republican

Filtering/slicing differences

One difference is that R doesn’t use row names for slicing whereas in python you can use indexes for slicing.

py_eval("PyPresident.loc[['1953-01-20']]")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican

Another difference is that when you filter rows in R the row name is dropped by default as the output is printed as a tibble. Remember, tibbles omit row names.

presidential_names %>% filter(start=="1953-01-20") 
##         name      start        end      party
## 1 Eisenhower 1953-01-20 1961-01-20 Republican

However, the index is persevered when rows are filtered in python

# Filtering using == 
py_eval("PyPresident[PyPresident.index=='1953-01-20']")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican
# Filtering using .loc()
py_eval("PyPresident.loc[['1953-01-20']]")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican

Other differences

R only supports strings for row names.

presidential_names %>% row.names() %>% class()
## [1] "character"

Even if you explicitly try to coerce characters into another type, it doesn’t work. Let us try to convert the row names back to a date format. It did not work, the row names are still characters.

presidential_Date<-presidential_names

row.names(presidential_Date)<-as.Date(presidential_names %>% row.names(), "%Y-%m-%d") 

presidential_Date %>% row.names() %>% class()
## [1] "character"

When this R data frame is converted into a python data frame, the indexes inherit this character format (object is python’s equivalent of R’s character format).

py_eval("PyPresident.head().index")
## Index(['1953-01-20', '1961-01-20', '1963-11-22', '1969-01-20', '1974-08-09'], dtype='object')

The disadvantage of having the indexes in strings is that you neither can use partial temporal information

py_eval("PyPresident.loc['1953']")
# prints-> KeyError: '1953'

nor other permutation of temporal information for slicing.

py_eval("PyPresident.loc['Jan 1953']")
# prints -> KeyError: 'Jan 1953'

You need to use the complete string to filter out the row.

py_eval("PyPresident.loc[['1953-01-20']]")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican

Changing the index from strings to date-time format will allow you to achieve them.

py_run_string("PyPresident_Date= PyPresident.copy()")
py_run_string("PyPresident_Date.index = pd.to_datetime(PyPresident_Date.index)")

py_eval("PyPresident_Date.head().index")
## DatetimeIndex(['1953-01-20', '1961-01-20', '1963-11-22', '1969-01-20',
##                '1974-08-09'],
##               dtype='datetime64[ns]', freq=None)

Filtering for presidents using only year. E.g. which president started his term in year 1953?

py_eval("PyPresident_Date.loc['1953']")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican

Filtering for presidents using year and shorthand for month. E.g. which president started his term in January 1953.

py_eval("PyPresident_Date.loc['Jan 1953']")
##                   name       start         end       party
## 1953-01-20  Eisenhower  1953-01-20  1961-01-20  Republican


Differences between R and python data frames (sub setting)

For this section, we will use msleep dataset from ggplot. This dataset is about the sleep patterns and other biological information of different mammals.

msleep %>% head() %>% knitr::kable()
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
Cheetah Acinonyx carni Carnivora lc 12.1 NA NA 11.9 NA 50.000
Owl monkey Aotus omni Primates NA 17.0 1.8 NA 7.0 0.01550 0.480
Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NA 9.6 NA 1.350
Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.1333333 9.1 0.00029 0.019
Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.6666667 20.0 0.42300 600.000
Three-toed sloth Bradypus herbi Pilosa NA 14.4 2.2 0.7666667 9.6 NA 3.850

Similar to sub setting lists, there are 2 approaches to subset data frames.

  1. Preserved sub setting
  2. Simplified sub setting

Preserved sub setting

In preserved sub setting, the data structure of the input is the same as the output.

Preserved sub setting in R

In R, the techniques for preserved sub setting are applicable to selecting both single and multiple variables.

  1. base R via singular square brackets [ ].
msleep["name"] %>% class()
## [1] "tbl_df"     "tbl"        "data.frame"
msleep[, c("name", "sleep_total")] %>% class()
## [1] "tbl_df"     "tbl"        "data.frame"
  1. tidyverse verb, dplyr::select.
msleep %>% select(name) %>% class()
## [1] "tbl_df"     "tbl"        "data.frame"
msleep %>% select(name, sleep_total) %>% class()
## [1] "tbl_df"     "tbl"        "data.frame"

Preserved sub setting in python

In python, the number of techniques for preserved sub setting depends if a single variable or multiple variables is selected.

  1. square brackets [[ ]]

It can be used to select both single and multiple variables. It is the only technique to do preserved sub setting for a single variable.

py_eval("type(r.msleep[['name']])")
## <class 'pandas.core.frame.DataFrame'>
py_eval("type(r.msleep[['name', 'sleep_total']])")  
## <class 'pandas.core.frame.DataFrame'>

For both kinds of sub setting, the number of square brackets used in python is the opposite to what is used in R. In this case, R uses singular square brackets for preserved sub setting and python uses DUAL square brackets for the same kind of sub setting.

  1. .loc attribute

It can only be used for selecting multiple variables if preserved sub setting is desired.

py_eval("type(r.msleep.loc[:, ('name', 'sleep_total')])") 
## <class 'pandas.core.frame.DataFrame'>

Simplified sub setting

For simplified sub setting, “the simplest possible data structure that can represent the output” is returned. Simplest data structures are one-dimensional structures. Thus, it is impossible to convert and store values from multiple variables in these one-dimensional structures. In other words, you can only do simplified sub setting to select a single variable from a data frame.

Simplified sub setting in R

For R there are three techniques to do simplified sub setting.

  1. base R via dual square brackets [[ ]]
msleep[["name"]] %>% class()
## [1] "character"
  1. tidyverse verb, dplyr::pull
msleep %>% pull(name) %>% class()
## [1] "character"
  1. dollar sign syntax $
msleep$name %>% class()
## [1] "character"

Simplified sub setting in python

When you sub set a python data frame in a simplified manner, you will get a series and not a python list nor python array. Series is a one-dimensional array with named axis labels and is governed by the pandas package. There are 3 techniques to do simplified sub setting for python data frames.

  1. sqaure brackets [ ]

Similar to preserved sub setting, the number of brackets for simplified sub setting in python is different than R. Singular square brackets [ ] are used for simplified sub setting in python compared to double square brackets in R.

py_eval("type(r.msleep['name'])")
## <class 'pandas.core.series.Series'>
  1. .loc attirbute
py_eval("type(r.msleep.loc[:,('name')])") 
## <class 'pandas.core.series.Series'>
  1. . techinque
py_eval("type(r.msleep.name)")
## <class 'pandas.core.series.Series'>


Side note about working with R data frames

Occasionally, you may need to convert multi-element named vectors into data frames. For example when working with quantiles.

(Rvec_name<-quantile(rnorm(100)))
##         0%        25%        50%        75%       100% 
## -2.2622171 -0.6303713  0.1307985  0.7885520  2.5857533

We can present data frames either in a long format or in a wide format. In a long data frame, the number of rows is equal to the length of the vector thus; the values of the vector form a column. The names in the vector appear as row names in the data frame.

data.frame(percentiles=names(Rvec_name), values=(Rvec_name))
##      percentiles     values
## 0%            0% -2.2622171
## 25%          25% -0.6303713
## 50%          50%  0.1307985
## 75%          75%  0.7885520
## 100%        100%  2.5857533

In a wide data frame, the number of columns (not rows) is equal to the length of the vector hence the values of the vector will occupy a column of their own. A neat trick to directly achieve a wide data frame from named vectors without any need of transposition is to use tibble with !!!.

tibble(!!!Rvec_name)
## # A tibble: 1 x 5
##    `0%`  `25%` `50%` `75%` `100%`
##   <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 -2.26 -0.630 0.131 0.789   2.59