What `R` you? (R dataframe vs python dataframe)
Recap
Previously in this series, we discovered the equivalent python
data structures for the following R
data structures:
In this post, we will look at translating R
data frames into python
. We will also compare and contrast data frames in R
and python
.
R
data frame is a python
…
Pretty straight forward, a R
data frame is a python
data frame.
We will use an in built data frame, OrchardSprays, for our illustration. This dataset is from a study, which examined the relationship on the uptake of sugar solutions by bees when different concentrations of pesticide were added to the syrup.
OrchardSprays has four variables:
rowpos: The row position of the solution
colpos: The column position of the solution
treatment: The type of orchard spray added to the solution
decrease: The decrease in volume of the solution at the end of the study which indicates the amount of syrup consumed by the bees.
library(tidyverse)
library(reticulate)
py_run_string("import pandas as pd")
OrchardSprays %>% select(rowpos, colpos, treatment, decrease) %>% head()
## rowpos colpos treatment decrease
## 1 1 1 D 57
## 2 2 1 E 95
## 3 3 1 B 8
## 4 4 1 H 69
## 5 5 1 G 92
## 6 6 1 F 90
Let’s check the data structure of OrchardSprays in R
.
class(OrchardSprays)
## [1] "data.frame"
And now let is check the data structure when it is translated to python
.
r_to_py(OrchardSprays) %>% class()
## [1] "pandas.core.frame.DataFrame"
## [2] "pandas.core.generic.NDFrame"
## [3] "pandas.core.base.PandasObject"
## [4] "pandas.core.accessor.DirNamesMixin"
## [5] "pandas.core.base.SelectionMixin"
## [6] "python.builtin.object"
Data frames in python
are governed by the pandas
package.
List columns
Occasionally, columns in a data frame contain lists instead of atomic vectors. These data frames are also known as nested data frames. Let us see how python
handles a nested data frame from R
. First, we will create a nested data frame. Variables “rowpos”, “colpos”, “decrease” will be nested inside the type of treatment. The nested variables appear as a list column called “data”.
Orchard_nest<- OrchardSprays %>% nest(-treatment)
knitr::kable(Orchard_nest)
treatment | data |
---|---|
D | list(decrease = c(57, 36, 22, 51, 28, 27, 20, 39), rowpos = c(1, 4, 5, 8, 2, 6, 7, 3), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
E | list(decrease = c(95, 51, 39, 114, 43, 47, 61, 55), rowpos = c(2, 5, 4, 3, 1, 7, 8, 6), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
B | list(decrease = c(8, 6, 4, 10, 7, 4, 8, 14), rowpos = c(3, 2, 8, 7, 6, 5, 1, 4), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
H | list(decrease = c(69, 127, 72, 130, 81, 76, 81, 86), rowpos = c(4, 3, 2, 1, 7, 8, 6, 5), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
G | list(decrease = c(92, 71, 72, 24, 60, 77, 72, 80), rowpos = c(5, 8, 7, 6, 3, 4, 2, 1), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
F | list(decrease = c(90, 69, 87, 20, 71, 44, 57, 114), rowpos = c(6, 7, 1, 5, 8, 3, 4, 2), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
C | list(decrease = c(15, 84, 16, 9, 17, 29, 13, 19), rowpos = c(7, 1, 6, 4, 5, 2, 3, 8), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
A | list(decrease = c(2, 2, 5, 4, 5, 12, 4, 3), rowpos = c(8, 6, 3, 2, 4, 1, 5, 7), colpos = c(1, 2, 3, 4, 5, 6, 7, 8)) |
A nested data frame is still a R
data frame.
class(Orchard_nest)
## [1] "data.frame"
Hence, it should be treated as a python
data frame when translated.
r_to_py(Orchard_nest) %>% class()
## [1] "pandas.core.frame.DataFrame"
## [2] "pandas.core.generic.NDFrame"
## [3] "pandas.core.base.PandasObject"
## [4] "pandas.core.accessor.DirNamesMixin"
## [5] "pandas.core.base.SelectionMixin"
## [6] "python.builtin.object"
The dimension of a nested data frame in R
and the translated data frame is the same.
dim(Orchard_nest)
## [1] 8 2
r_to_py(Orchard_nest) %>% dim()
## [1] 8 2
Inspecting the internal structure of the nested data frame, we can see for the column “data” there are nested variables within it .
str(Orchard_nest)
## 'data.frame': 8 obs. of 2 variables:
## $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1
## $ data :List of 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 57 36 22 51 28 27 20 39
## .. ..$ rowpos : num 1 4 5 8 2 6 7 3
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 95 51 39 114 43 47 61 55
## .. ..$ rowpos : num 2 5 4 3 1 7 8 6
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 8 6 4 10 7 4 8 14
## .. ..$ rowpos : num 3 2 8 7 6 5 1 4
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 69 127 72 130 81 76 81 86
## .. ..$ rowpos : num 4 3 2 1 7 8 6 5
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 92 71 72 24 60 77 72 80
## .. ..$ rowpos : num 5 8 7 6 3 4 2 1
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 90 69 87 20 71 44 57 114
## .. ..$ rowpos : num 6 7 1 5 8 3 4 2
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 15 84 16 9 17 29 13 19
## .. ..$ rowpos : num 7 1 6 4 5 2 3 8
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
## ..$ :'data.frame': 8 obs. of 3 variables:
## .. ..$ decrease: num 2 2 5 4 5 12 4 3
## .. ..$ rowpos : num 8 6 3 2 4 1 5 7
## .. ..$ colpos : num 1 2 3 4 5 6 7 8
However, when inspecting the structure of the translated nested data frame, the list column appears as an ordinary python
column.
r.Orchard_nest.info(verbose=True) #ran using {python} code chunk. unable to run in {r} code chunk with py_eval nor py_run_string
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 8 entries, 0 to 7
## Data columns (total 2 columns):
## treatment 8 non-null category
## data 8 non-null object
## dtypes: category(1), object(1)
## memory usage: 584.0+ bytes
Also, when you print a nested data frame from R
in python
, the list column appears partially printed with misalignment.
py_eval("r.Orchard_nest")
## treatment data
## 0 D decrease rowpos colpos
## 0 57.0 1....
## 1 E decrease rowpos colpos
## 0 95.0 2....
## 2 B decrease rowpos colpos
## 0 8.0 3....
## 3 H decrease rowpos colpos
## 0 69.0 4....
## 4 G decrease rowpos colpos
## 0 92.0 5....
## 5 F decrease rowpos colpos
## 0 90.0 6....
## 6 C decrease rowpos colpos
## 0 15.0 7....
## 7 A decrease rowpos colpos
## 0 2.0 8....
Tibbles
Tibbles are the modern cousin of traditional R
data frames. Let us see how they are treated in python
. First, we will convert a R
data frame into a tibble. We will use another built in data frame in R
, LifeCycleSavings. This dataset is about the savings ratio of different countries from 1960-1970.
LifeCycleSavings %>% head()
## sr pop15 pop75 dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
The output tbl_df
indicates the structure is a tibble.
LCS_tibble<-as_tibble(LifeCycleSavings)
class(LCS_tibble)
## [1] "tbl_df" "tbl" "data.frame"
One thing to note about tibbles is that they do not have row names thus the row names were dropped when the data frame were converted into a tibble.
LCS_tibble %>% head()
## # A tibble: 6 x 5
## sr pop15 pop75 dpi ddpi
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11.4 29.4 2.87 2330. 2.87
## 2 12.1 23.3 4.41 1508. 3.93
## 3 13.2 23.8 4.43 2108. 3.82
## 4 5.75 41.9 1.67 189. 0.22
## 5 12.9 42.2 0.83 728. 4.56
## 6 8.79 31.7 2.85 2983. 2.43
Now let’s see how python
recognizes a tibble.
r_to_py(LCS_tibble) %>% class()
## [1] "pandas.core.frame.DataFrame"
## [2] "pandas.core.generic.NDFrame"
## [3] "pandas.core.base.PandasObject"
## [4] "pandas.core.accessor.DirNamesMixin"
## [5] "pandas.core.base.SelectionMixin"
## [6] "python.builtin.object"
The tibble is translated into a python
data frame.
Let us check that there are no glitches when python
translates a tibble into a python
dataframe.
py_eval("r.LCS_tibble.head()")
## sr pop15 pop75 dpi ddpi
## 0 11.43 29.35 2.87 2329.68 2.87
## 1 12.07 23.32 4.41 1507.99 3.93
## 2 13.17 23.80 4.43 2108.47 3.82
## 3 5.75 41.89 1.67 189.13 0.22
## 4 12.88 42.19 0.83 728.47 4.56
The translated tibble prints the same as a R
tibble.
Differences between R
and python
data frames (row names and indexes)
Intro
If the data frame in R
has row names it will be maintained in the python
data frame.
LifeCycleSavings %>% head() %>% row.names()
## [1] "Australia" "Austria" "Belgium" "Bolivia" "Brazil" "Canada"
Just that in python
row names are termed as indexes.
py_eval("r.LifeCycleSavings.head().index")
## Index(['Australia', 'Austria', 'Belgium', 'Bolivia', 'Brazil'], dtype='object')
Creating row names and indexes
There are R
data frames without row names such as this pre-installed data frame, presidential. This dataset is about the terms of the eleven USA presidents from Eisenhower to Obama.
presidential %>% head()
## # A tibble: 6 x 4
## name start end party
## <chr> <date> <date> <chr>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy 1961-01-20 1963-11-22 Democratic
## 3 Johnson 1963-11-22 1969-01-20 Democratic
## 4 Nixon 1969-01-20 1974-08-09 Republican
## 5 Ford 1974-08-09 1977-01-20 Republican
## 6 Carter 1977-01-20 1981-01-20 Democratic
Occasionally, you may decide to create row names for your data frame using your own set of values or values from a variable from the data frame. Let us designate the values from the “start” column to be the row names.
presidential_names<-presidential %>% mutate(start_term=start) %>% column_to_rownames(var="start") %>% select(name, start=start_term, everything())
presidential_names %>% head()
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
## 1961-01-20 Kennedy 1961-01-20 1963-11-22 Democratic
## 1963-11-22 Johnson 1963-11-22 1969-01-20 Democratic
## 1969-01-20 Nixon 1969-01-20 1974-08-09 Republican
## 1974-08-09 Ford 1974-08-09 1977-01-20 Republican
## 1977-01-20 Carter 1977-01-20 1981-01-20 Democratic
And we will expect the row names to appear as index in python
.
py_run_string("PyPresident= r.presidential_names") #renamed for the convenience of future use
py_eval("PyPresident.head()")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
## 1961-01-20 Kennedy 1961-01-20 1963-11-22 Democratic
## 1963-11-22 Johnson 1963-11-22 1969-01-20 Democratic
## 1969-01-20 Nixon 1969-01-20 1974-08-09 Republican
## 1974-08-09 Ford 1974-08-09 1977-01-20 Republican
Filtering/slicing differences
One difference is that R
doesn’t use row names for slicing whereas in python
you can use indexes for slicing.
py_eval("PyPresident.loc[['1953-01-20']]")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
Another difference is that when you filter rows in R
the row name is dropped by default as the output is printed as a tibble. Remember, tibbles omit row names.
presidential_names %>% filter(start=="1953-01-20")
## name start end party
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
However, the index is persevered when rows are filtered in python
# Filtering using ==
py_eval("PyPresident[PyPresident.index=='1953-01-20']")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
# Filtering using .loc()
py_eval("PyPresident.loc[['1953-01-20']]")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
Other differences
R
only supports strings for row names.
presidential_names %>% row.names() %>% class()
## [1] "character"
Even if you explicitly try to coerce characters into another type, it doesn’t work. Let us try to convert the row names back to a date format. It did not work, the row names are still characters.
presidential_Date<-presidential_names
row.names(presidential_Date)<-as.Date(presidential_names %>% row.names(), "%Y-%m-%d")
presidential_Date %>% row.names() %>% class()
## [1] "character"
When this R
data frame is converted into a python
data frame, the indexes inherit this character format (object
is python
’s equivalent of R
’s character
format).
py_eval("PyPresident.head().index")
## Index(['1953-01-20', '1961-01-20', '1963-11-22', '1969-01-20', '1974-08-09'], dtype='object')
The disadvantage of having the indexes in strings is that you neither can use partial temporal information
py_eval("PyPresident.loc['1953']")
# prints-> KeyError: '1953'
nor other permutation of temporal information for slicing.
py_eval("PyPresident.loc['Jan 1953']")
# prints -> KeyError: 'Jan 1953'
You need to use the complete string to filter out the row.
py_eval("PyPresident.loc[['1953-01-20']]")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
Changing the index from strings to date-time format will allow you to achieve them.
py_run_string("PyPresident_Date= PyPresident.copy()")
py_run_string("PyPresident_Date.index = pd.to_datetime(PyPresident_Date.index)")
py_eval("PyPresident_Date.head().index")
## DatetimeIndex(['1953-01-20', '1961-01-20', '1963-11-22', '1969-01-20',
## '1974-08-09'],
## dtype='datetime64[ns]', freq=None)
Filtering for presidents using only year. E.g. which president started his term in year 1953?
py_eval("PyPresident_Date.loc['1953']")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
Filtering for presidents using year and shorthand for month. E.g. which president started his term in January 1953.
py_eval("PyPresident_Date.loc['Jan 1953']")
## name start end party
## 1953-01-20 Eisenhower 1953-01-20 1961-01-20 Republican
Differences between R
and python
data frames (sub setting)
For this section, we will use msleep dataset from ggplot
. This dataset is about the sleep patterns and other biological information of different mammals.
msleep %>% head() %>% knitr::kable()
name | genus | vore | order | conservation | sleep_total | sleep_rem | sleep_cycle | awake | brainwt | bodywt |
---|---|---|---|---|---|---|---|---|---|---|
Cheetah | Acinonyx | carni | Carnivora | lc | 12.1 | NA | NA | 11.9 | NA | 50.000 |
Owl monkey | Aotus | omni | Primates | NA | 17.0 | 1.8 | NA | 7.0 | 0.01550 | 0.480 |
Mountain beaver | Aplodontia | herbi | Rodentia | nt | 14.4 | 2.4 | NA | 9.6 | NA | 1.350 |
Greater short-tailed shrew | Blarina | omni | Soricomorpha | lc | 14.9 | 2.3 | 0.1333333 | 9.1 | 0.00029 | 0.019 |
Cow | Bos | herbi | Artiodactyla | domesticated | 4.0 | 0.7 | 0.6666667 | 20.0 | 0.42300 | 600.000 |
Three-toed sloth | Bradypus | herbi | Pilosa | NA | 14.4 | 2.2 | 0.7666667 | 9.6 | NA | 3.850 |
Similar to sub setting lists, there are 2 approaches to subset data frames.
- Preserved sub setting
- Simplified sub setting
Preserved sub setting
In preserved sub setting, the data structure of the input is the same as the output.
Preserved sub setting in R
In R
, the techniques for preserved sub setting are applicable to selecting both single and multiple variables.
- base
R
via singular square brackets[ ]
.
msleep["name"] %>% class()
## [1] "tbl_df" "tbl" "data.frame"
msleep[, c("name", "sleep_total")] %>% class()
## [1] "tbl_df" "tbl" "data.frame"
- tidyverse verb,
dplyr::select
.
msleep %>% select(name) %>% class()
## [1] "tbl_df" "tbl" "data.frame"
msleep %>% select(name, sleep_total) %>% class()
## [1] "tbl_df" "tbl" "data.frame"
Preserved sub setting in python
In python
, the number of techniques for preserved sub setting depends if a single variable or multiple variables is selected.
- square brackets
[[ ]]
It can be used to select both single and multiple variables. It is the only technique to do preserved sub setting for a single variable.
py_eval("type(r.msleep[['name']])")
## <class 'pandas.core.frame.DataFrame'>
py_eval("type(r.msleep[['name', 'sleep_total']])")
## <class 'pandas.core.frame.DataFrame'>
For both kinds of sub setting, the number of square brackets used in python
is the opposite to what is used in R
. In this case, R
uses singular square brackets for preserved sub setting and python
uses DUAL square brackets for the same kind of sub setting.
.loc
attribute
It can only be used for selecting multiple variables if preserved sub setting is desired.
py_eval("type(r.msleep.loc[:, ('name', 'sleep_total')])")
## <class 'pandas.core.frame.DataFrame'>
Simplified sub setting
For simplified sub setting, “the simplest possible data structure that can represent the output” is returned. Simplest data structures are one-dimensional structures. Thus, it is impossible to convert and store values from multiple variables in these one-dimensional structures. In other words, you can only do simplified sub setting to select a single variable from a data frame.
Simplified sub setting in R
For R
there are three techniques to do simplified sub setting.
- base
R
via dual square brackets[[ ]]
msleep[["name"]] %>% class()
## [1] "character"
- tidyverse verb,
dplyr::pull
msleep %>% pull(name) %>% class()
## [1] "character"
- dollar sign syntax
$
msleep$name %>% class()
## [1] "character"
Simplified sub setting in python
When you sub set a python
data frame in a simplified manner, you will get a series and not a python
list nor python
array. Series is a one-dimensional array with named axis labels and is governed by the pandas
package. There are 3 techniques to do simplified sub setting for python
data frames.
- sqaure brackets
[ ]
Similar to preserved sub setting, the number of brackets for simplified sub setting in python
is different than R
. Singular square brackets [ ]
are used for simplified sub setting in python
compared to double square brackets in R
.
py_eval("type(r.msleep['name'])")
## <class 'pandas.core.series.Series'>
.loc
attirbute
py_eval("type(r.msleep.loc[:,('name')])")
## <class 'pandas.core.series.Series'>
.
techinque
py_eval("type(r.msleep.name)")
## <class 'pandas.core.series.Series'>
Side note about working with R
data frames
Occasionally, you may need to convert multi-element named vectors into data frames. For example when working with quantiles.
(Rvec_name<-quantile(rnorm(100)))
## 0% 25% 50% 75% 100%
## -2.2622171 -0.6303713 0.1307985 0.7885520 2.5857533
We can present data frames either in a long format or in a wide format. In a long data frame, the number of rows is equal to the length of the vector thus; the values of the vector form a column. The names in the vector appear as row names in the data frame.
data.frame(percentiles=names(Rvec_name), values=(Rvec_name))
## percentiles values
## 0% 0% -2.2622171
## 25% 25% -0.6303713
## 50% 50% 0.1307985
## 75% 75% 0.7885520
## 100% 100% 2.5857533
In a wide data frame, the number of columns (not rows) is equal to the length of the vector hence the values of the vector will occupy a column of their own. A neat trick to directly achieve a wide data frame from named vectors without any need of transposition is to use tibble
with !!!
.
tibble(!!!Rvec_name)
## # A tibble: 1 x 5
## `0%` `25%` `50%` `75%` `100%`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -2.26 -0.630 0.131 0.789 2.59