Split column into intervals based on row content
up vote
2
down vote
favorite
I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
r
add a comment |
up vote
2
down vote
favorite
I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
r
Make a list of data frames (each one is an item and its associated data) and then append them together and assignNA
to empty columns.
– Masoud
Nov 8 at 19:18
@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
r
I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
r
r
asked Nov 8 at 19:15
I.I.
111
111
Make a list of data frames (each one is an item and its associated data) and then append them together and assignNA
to empty columns.
– Masoud
Nov 8 at 19:18
@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21
add a comment |
Make a list of data frames (each one is an item and its associated data) and then append them together and assignNA
to empty columns.
– Masoud
Nov 8 at 19:18
@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21
Make a list of data frames (each one is an item and its associated data) and then append them together and assign
NA
to empty columns.– Masoud
Nov 8 at 19:18
Make a list of data frames (each one is an item and its associated data) and then append them together and assign
NA
to empty columns.– Masoud
Nov 8 at 19:18
@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21
@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF
, the variable var
, and a vector of column names in the correct order colnames
and byitem
to choose the output format (default is TRUE
, which outputs a dataframe with one row per item
):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE)
if(byitem)
ID = sym("rowid")
else
ID = sym("id")
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. If you are willing to impose an assumption that "$" is alwaysprice
, characters are alwayscolor
, and dates are alwaysdate
, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF
, the variable var
, and a vector of column names in the correct order colnames
and byitem
to choose the output format (default is TRUE
, which outputs a dataframe with one row per item
):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE)
if(byitem)
ID = sym("rowid")
else
ID = sym("id")
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. If you are willing to impose an assumption that "$" is alwaysprice
, characters are alwayscolor
, and dates are alwaysdate
, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
add a comment |
up vote
0
down vote
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF
, the variable var
, and a vector of column names in the correct order colnames
and byitem
to choose the output format (default is TRUE
, which outputs a dataframe with one row per item
):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE)
if(byitem)
ID = sym("rowid")
else
ID = sym("id")
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. If you are willing to impose an assumption that "$" is alwaysprice
, characters are alwayscolor
, and dates are alwaysdate
, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
add a comment |
up vote
0
down vote
up vote
0
down vote
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF
, the variable var
, and a vector of column names in the correct order colnames
and byitem
to choose the output format (default is TRUE
, which outputs a dataframe with one row per item
):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE)
if(byitem)
ID = sym("rowid")
else
ID = sym("id")
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF
, the variable var
, and a vector of column names in the correct order colnames
and byitem
to choose the output format (default is TRUE
, which outputs a dataframe with one row per item
):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE)
if(byitem)
ID = sym("rowid")
else
ID = sym("id")
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))
edited Nov 8 at 19:57
answered Nov 8 at 19:36
avid_useR
11.5k41830
11.5k41830
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. If you are willing to impose an assumption that "$" is alwaysprice
, characters are alwayscolor
, and dates are alwaysdate
, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
add a comment |
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. If you are willing to impose an assumption that "$" is alwaysprice
, characters are alwayscolor
, and dates are alwaysdate
, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03
@I.I. If you are willing to impose an assumption that "$" is always
price
, characters are always color
, and dates are always date
, then I can modify my answer to accommodate that.– avid_useR
Nov 8 at 20:05
@I.I. If you are willing to impose an assumption that "$" is always
price
, characters are always color
, and dates are always date
, then I can modify my answer to accommodate that.– avid_useR
Nov 8 at 20:05
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53214673%2fsplit-column-into-intervals-based-on-row-content%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Make a list of data frames (each one is an item and its associated data) and then append them together and assign
NA
to empty columns.– Masoud
Nov 8 at 19:18
@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21