Split column into intervals based on row content

up vote
2
down vote

favorite

I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.

I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.

data

item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017

Thank you!

asked Nov 8 at 19:15

I.I.

111

Make a list of data frames (each one is an item and its associated data) and then append them together and assign NA to empty columns.
– Masoud
Nov 8 at 19:18

@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21

add a comment |

up vote
2
down vote

favorite

data

item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017

Thank you!

asked Nov 8 at 19:15

I.I.

111

Make a list of data frames (each one is an item and its associated data) and then append them together and assign NA to empty columns.
– Masoud
Nov 8 at 19:18

@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21

add a comment |

up vote
2
down vote

favorite

data

item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017

Thank you!

asked Nov 8 at 19:15

I.I.

111

data

item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017

Thank you!

asked Nov 8 at 19:15

I.I.

111

asked Nov 8 at 19:15

I.I.

111

asked Nov 8 at 19:15

I.I.

111

asked Nov 8 at 19:15

I.I.

111

asked Nov 8 at 19:15

I.I.

111

Make a list of data frames (each one is an item and its associated data) and then append them together and assign NA to empty columns.
– Masoud
Nov 8 at 19:18

@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21

add a comment |

Make a list of data frames (each one is an item and its associated data) and then append them together and assign NA to empty columns.
– Masoud
Nov 8 at 19:18

@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21

Make a list of data frames (each one is an item and its associated data) and then append them together and assign NA to empty columns.
– Masoud
Nov 8 at 19:18

@Masoud that's kind of what I'm trying to figure out how to do. I'm trying to split this data (I received it like this) by item, but since each item has a different number of associated rows, with different content, I'm not sure how
– I.I.
Nov 8 at 19:21

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF, the variable var, and a vector of column names in the correct order colnames and byitem to choose the output format (default is TRUE, which outputs a dataframe with one row per item):

library(tidyverse)

df_transform = function(DF, var, colnames, byitem = TRUE)
 if(byitem)
 ID = sym("rowid")
 else
 ID = sym("id")
 
 DF %>%
 group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
 mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
 filter(!grepl("item", var)) %>%
 spread(!!ID, var)

Output:

> df_transform(df, var, c("price", "color", "date"))

# A tibble: 3 x 4
# Groups: id [3]
 id color date price
 <chr> <fct> <fct> <fct>
1 item1 red <NA> $600 
2 item2 <NA> <NA> $70 
3 item3 orange 10/11/2017 $430 


> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)

# A tibble: 3 x 4
 rowid item1 item2 item3 
 <chr> <fct> <fct> <fct> 
1 color red <NA> orange 
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430

Note that this would not work if you have missing values in the middle, since the column names are assigned by position.

Data:

df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L, 
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1", 
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA, 
-9L))

edited Nov 8 at 19:57

answered Nov 8 at 19:36

avid_useR

11.5k41830

Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59

@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03

@I.I. If you are willing to impose an assumption that "$" is always price, characters are always color, and dates are always date, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05

thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36

@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53214673%2fsplit-column-into-intervals-based-on-row-content%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

library(tidyverse)

df_transform = function(DF, var, colnames, byitem = TRUE)
 if(byitem)
 ID = sym("rowid")
 else
 ID = sym("id")
 
 DF %>%
 group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
 mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
 filter(!grepl("item", var)) %>%
 spread(!!ID, var)

Output:

> df_transform(df, var, c("price", "color", "date"))

# A tibble: 3 x 4
# Groups: id [3]
 id color date price
 <chr> <fct> <fct> <fct>
1 item1 red <NA> $600 
2 item2 <NA> <NA> $70 
3 item3 orange 10/11/2017 $430 


> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)

# A tibble: 3 x 4
 rowid item1 item2 item3 
 <chr> <fct> <fct> <fct> 
1 color red <NA> orange 
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430

Note that this would not work if you have missing values in the middle, since the column names are assigned by position.

Data:

df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L, 
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1", 
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA, 
-9L))

edited Nov 8 at 19:57

answered Nov 8 at 19:36

avid_useR

11.5k41830

Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59

@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03

@I.I. If you are willing to impose an assumption that "$" is always price, characters are always color, and dates are always date, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05

thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36

@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02

add a comment |

up vote
0
down vote

library(tidyverse)

df_transform = function(DF, var, colnames, byitem = TRUE)
 if(byitem)
 ID = sym("rowid")
 else
 ID = sym("id")
 
 DF %>%
 group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
 mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
 filter(!grepl("item", var)) %>%
 spread(!!ID, var)

Output:

> df_transform(df, var, c("price", "color", "date"))

# A tibble: 3 x 4
# Groups: id [3]
 id color date price
 <chr> <fct> <fct> <fct>
1 item1 red <NA> $600 
2 item2 <NA> <NA> $70 
3 item3 orange 10/11/2017 $430 


> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)

# A tibble: 3 x 4
 rowid item1 item2 item3 
 <chr> <fct> <fct> <fct> 
1 color red <NA> orange 
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430

Note that this would not work if you have missing values in the middle, since the column names are assigned by position.

Data:

df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L, 
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1", 
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA, 
-9L))

edited Nov 8 at 19:57

answered Nov 8 at 19:36

avid_useR

11.5k41830

Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59

@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03

@I.I. If you are willing to impose an assumption that "$" is always price, characters are always color, and dates are always date, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05

thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36

@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02

add a comment |

up vote
0
down vote

library(tidyverse)

df_transform = function(DF, var, colnames, byitem = TRUE)
 if(byitem)
 ID = sym("rowid")
 else
 ID = sym("id")
 
 DF %>%
 group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
 mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
 filter(!grepl("item", var)) %>%
 spread(!!ID, var)

Output:

> df_transform(df, var, c("price", "color", "date"))

# A tibble: 3 x 4
# Groups: id [3]
 id color date price
 <chr> <fct> <fct> <fct>
1 item1 red <NA> $600 
2 item2 <NA> <NA> $70 
3 item3 orange 10/11/2017 $430 


> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)

# A tibble: 3 x 4
 rowid item1 item2 item3 
 <chr> <fct> <fct> <fct> 
1 color red <NA> orange 
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430

Note that this would not work if you have missing values in the middle, since the column names are assigned by position.

Data:

df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L, 
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1", 
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA, 
-9L))

edited Nov 8 at 19:57

answered Nov 8 at 19:36

avid_useR

11.5k41830

library(tidyverse)

df_transform = function(DF, var, colnames, byitem = TRUE)
 if(byitem)
 ID = sym("rowid")
 else
 ID = sym("id")
 
 DF %>%
 group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
 mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
 filter(!grepl("item", var)) %>%
 spread(!!ID, var)

Output:

> df_transform(df, var, c("price", "color", "date"))

# A tibble: 3 x 4
# Groups: id [3]
 id color date price
 <chr> <fct> <fct> <fct>
1 item1 red <NA> $600 
2 item2 <NA> <NA> $70 
3 item3 orange 10/11/2017 $430 


> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)

# A tibble: 3 x 4
 rowid item1 item2 item3 
 <chr> <fct> <fct> <fct> 
1 color red <NA> orange 
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430

Note that this would not work if you have missing values in the middle, since the column names are assigned by position.

Data:

df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L, 
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1", 
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA, 
-9L))

edited Nov 8 at 19:57

answered Nov 8 at 19:36

avid_useR

11.5k41830

edited Nov 8 at 19:57

answered Nov 8 at 19:36

avid_useR

11.5k41830

answered Nov 8 at 19:36

avid_useR

11.5k41830

answered Nov 8 at 19:36

avid_useR

11.5k41830

Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59

@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03

@I.I. If you are willing to impose an assumption that "$" is always price, characters are always color, and dates are always date, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05

thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36

@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02

add a comment |

Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59

@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03

@I.I. If you are willing to impose an assumption that "$" is always price, characters are always color, and dates are always date, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05

thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36

@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02

Thank you so much! This is great. I was wondering, though (sorry for not making it clear in my original question, and for forgetting to provide reproducible data), if there's something that can be done if the "price" and "color" columns, for example, are not always in the same order (as in, not always in row 2). Thank you again!
– I.I.
Nov 8 at 19:59

@I.I. Well, other than the position of your values, there is no way of knowing whether the literal "$100" is supposed to be a price or something else. Unless there is another column to indicate the column name for each row, or to naively assume that a value with a dollar sign is always "price", I'm afraid your data is too ambiguous to do this kind of transformation.
– avid_useR
Nov 8 at 20:03

@I.I. If you are willing to impose an assumption that "$" is always price, characters are always color, and dates are always date, then I can modify my answer to accommodate that.
– avid_useR
Nov 8 at 20:05

thank you! That makes sense... I think I'm going to go ahead and try this code but try a few different regex expressions to get the columns out...I do have just one more question, though - there are a lot of duplicated rows in the set of kind of useless information (like, table 1 will have two rows of orange, just as an example) and dropping duplicate rows removes all instances of orange (which I don't want). I was wondering if there was a way within using mutate + case_when to drop rows when it doesn't match. I hope that makes sense - thank you again!
– I.I.
Nov 8 at 22:36

@I.I. Do you mean only keeping the first row for each duplicate rows? Perhaps you should update your question and provide expected output to illustrate what you mean.
– avid_useR
Nov 9 at 3:02

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt