Parsing SVG paths in R
Parsing SVG paths in R
I'm trying to crack an R workflow for parsing SVG paths, using this file on this webpage. I'm encountering artifacts in the positioning of resulting polygons:
Some of the countries do not align with their neighbours - e.g. US/Canada, US/Mexico, Russia/Asian neighbours. Since the effect hits the countries with more complex polygons it seems likely to be a problem to do with cumulative summing, but I'm unclear where the problem lies in my workflow, which is:
nodejs
I reproduce the full workflow here using R (for US/Canada), with an external call to nodejs:
require(dplyr)
require(purrr)
require(stringr)
require(tidyr)
require(ggplot2)
require(rvest)
require(xml2)
require(jsonlite)
# Get and parse the SVG
doc = read_xml('https://visionscarto.net/public/fonds-de-cartes-en/visionscarto-bertin1953.svg')
countries = doc %>% html_nodes('.country')
names(countries) = html_attr(countries, 'id')
cdi = str_which(names(countries), 'CIV') # unicode in Cote d'Ivoire breaks the code
countries = countries[-cdi]
# Extract SVG paths and parse with node's svg-path-parser module.
# If you don't have node you can use this instead (note this step might be the problem):
# d = read_csv('https://gist.githubusercontent.com/geotheory/b7353a7a8a480209b31418c806cb1c9e/raw/6d3ba2a62f6e8667eef15e29a5893d9d795e8bb1/bertin_svg.csv')
d = imap_dfr(countries, ~
message(.y)
svg_path = xml_find_all(.x, paste0("//*[@id='", .y, "']/d1:path")) %>% html_attr('d')
node_call = paste0("node -e "var parseSVG = require('svg-path-parser'); var d='", svg_path,
"'; console.log(JSON.stringify(parseSVG(d)));"")
system(node_call, intern = T) %>% fromJSON %>% mutate(country = .y)
) %>% as_data_frame()
# some initial processing
d1 = d %>% filter(country %in% c('USA United States','CAN Canada')) %>%
mutate(x = replace_na(x, 0), y = replace_na(y, 0), # NAs need replacing
relative = replace_na(relative, FALSE),
grp = (command == 'closepath') %>% cumsum) # polygon grouping variable
# new object to loop through
d2 = d1 %>% mutate(x_adj = x, y_adj = y) %>% filter(command != 'closepath')
# loop through and change relative coords to absolute
for(i in 2:nrow(d2))
if(d2$relative[i]) # cumulative sum where coords are relative
d2$x_adj[i] = d2$x_adj[i-1] + d2$x_adj[i]
d2$y_adj[i] = d2$y_adj[i-1] + d2$y_adj[i]
else # code M/L require no alteration
if(d2$code[i] == 'V') d2$x_adj[i] = d2$x_adj[i-1] # absolute vertical transform inherits previous x
if(d2$code[i] == 'H') d2$y_adj[i] = d2$y_adj[i-1] # absolute holrizontal transform etc
# plot result
d2 %>% ggplot(aes(x_adj, -y_adj, group = paste(country, grp))) +
geom_polygon(fill='white', col='black', size=.3) +
coord_equal() + guides(fill=F)
Any assistance appreciated. The SVG path syntax is specified at w3 and summarised more concisely here.
Edit (response to @ccprog)
Here is data returned from svg-path-parser
for the H
command sequence:
svg-path-parser
H
code command x y relative country
<chr> <chr> <dbl> <dbl> <lgl> <chr>
1 l lineto -0.91 -0.6 TRUE CAN Canada
2 l lineto -0.92 -0.59 TRUE CAN Canada
3 H horizontal lineto 189. NA NA CAN Canada
4 l lineto -1.03 0.02 TRUE CAN Canada
5 l lineto -0.74 -0.07 TRUE CAN Canada
Here is what d2
looks like for same sequence after the loop:
d2
code command x y relative country grp x_adj y_adj
<chr> <chr> <dbl> <dbl> <lgl> <chr> <int> <dbl> <dbl>
1 l lineto -0.91 -0.6 TRUE CAN Canada 20 199. 143.
2 l lineto -0.92 -0.59 TRUE CAN Canada 20 198. 143.
3 H horizontal lineto 189. 0 FALSE CAN Canada 20 189. 143.
4 l lineto -1.03 0.02 TRUE CAN Canada 20 188. 143.
5 l lineto -0.74 -0.07 TRUE CAN Canada 20 187. 143.
Does this not look ok?. When I look at raw values for y_adj for H
and previous rows they are identical 142.56
.
H
142.56
d = imap_dfr(countries, ~
message(.y)
svg_path = xml_find_all(.x, paste0("//*[@id='", .y, "']/d1:path")) %>% html_attr('d')
node_call = paste0("node -e "var parseSVG = require('svg-path-parser'); var d='", svg_path,
"'; console.log(JSON.stringify(parseSVG.makeAbsolute(parseSVG(d))));"")
system(node_call, intern = T) %>% fromJSON %>% mutate(country = .y)
) %>% as_data_frame() %>%
mutate(grp = (command == 'moveto') %>% cumsum)
d %>% ggplot(aes(x, -y, group = grp, fill=country)) +
geom_polygon(col='black', size=.3, alpha=.5) +
coord_equal() + guides(fill=F)
I'm not familiar with R, but to me it looks like you divide the path into groups by looking for
closepath
commands, and then take the first moveto
in each group as starting point to cumulate positions from for the conversion to absolute. Two sources of errors are:1. moveto
commands, apart from the first one, can also be relative (to the last coordinate of the previous group). 2. Groups must not be closed with a closepath
command. Searching for the opening moveto
would be more reliable.– ccprog
Sep 9 '18 at 15:42
closepath
moveto
moveto
closepath
moveto
Hi @ccprog. I do use
closepath
to create variable grp
(that identifies unique polygons), but it does not have any role in parsing the actual coordinates. In fact I just use the SVG relative
field which as I understand specifies when coordinates are relative or absolute. With absolute codes you have to also account for H
/V
commands, which inherit the inactive coordinate from the previous point.– geotheory
Sep 9 '18 at 15:54
closepath
grp
relative
H
V
1 Answer
1
Look at your rendering of Canada, especially the southern coast of the Hudson sound. There is a very obvious error. Sieveing through the path data, I found the following sequence in the original data:
h-2.28l-.91-.6-.92-.59H188.65l-1.03.02-.74-.07-.75-.07-.74-.07-.74-.06.88 1.09
I've loaded your rendering result into Inkscape, and drawn the relevant part of the path on top, the arrow marking the segment drawn by the absolute H command. (The z command has been removed, that is the reason for the missing segment.) It is obvious that somewhere in there a segment is too long.
It turns out the absolute H
corrects the previous (horizontal) error. Look at the preceding point: it is 198., 143.
, but it should be 191.76,146.07
. The vertical error remains at about -3.6.
H
198., 143.
191.76,146.07
I've made a codepen that overlays the original path data with your rendering as precisely as possible. The path data have been divided into the (single-polygon) groups and converted to absolute by Inkscape. Unfortunately, the program cannot convert them to polygon primitives, so there are still V and H commands in there.
It shows this:
group0
I've made some visual measurements of that deviation (error ~0.05), and they ultimately give the clue:
group01: 0.44,-0.73
group02: 0.84,-1.12
group03: 2.04,-1.44
group04: 2.94,-1.73
group05: 2.60,-1.86
group06: 3.14,-2.38
group07: 3.68,-2.54
group08: 4.03,-3.35
group09: 4.87,-2.97
group10: 6.08,-3.50 (begin)
group10: 0.00,-3.53 (end)
group11: 1.08,-1.95
group12: 2.05,-2.45
group13: 2.89,-2.84
group14: 3.64,-3.67
group15: 4.48,-3.44
group16: 4.04,-3.99
group17: 4.32,-3.08
group18: 4.75,-2.75
group19: 5.72,-2.95
group20: 5.40,-3.11
group21: 6.02,-2.95
group22: 6.63,-4.14
group23: 6.85,-5.00
group24: 7.14,-4.86
group25: 7.72,-4.39
group26: 8.65,-4.75
group27: 9.49,-4.39
group28: 10.20,-4.44
group29: 11.13,-4.58
You are removing the closepath
commands, and then compute the first point of the next group relative to the last explicit point of the last group. But closepath
actually moves the ccurrent point: back to the position of the last moveto
command. These may, but need not be identical.
closepath
closepath
moveto
I can't give you a ready script in R, but what you need to do is this: at the beginning of a new group, cache the position of the first point. At the beginning of the next group, compute the new first point relative to that cached point.
Thanks for help ccprog. Good to focus on specifics like this. So it's true I set the NA values for
y
variable of H
commands to zero. But later I override that with if(d2$code[i] == 'H') d2$y_adj[i] = d2$y_adj[i-1]
- basically inherit previous y
value. I include relevant data.frame sections for d
and d2
- added to question.– geotheory
Sep 9 '18 at 17:14
y
H
if(d2$code[i] == 'H') d2$y_adj[i] = d2$y_adj[i-1]
y
d
d2
I understand, but I am absolutely sure the culprit is that absolute H command. I've added a screenshot to prove my point.
– ccprog
Sep 9 '18 at 17:42
No, the absolute H corrects the (horizontal) error. Look at the preceding point: it is
198., 143.
, but it should be 191.76,146.07
. The vertical error remains. If, in addition, I account for the vertical error and move your rendering up by dy=-3.6, the very first point of the path data matches. As far as I can judge, all other path groups are internally consistent, but the further down in the path data they are, the more they are off to the bottom left.– ccprog
Sep 9 '18 at 19:08
198., 143.
191.76,146.07
This is going to take some thinking. Will come back shortly.
– geotheory
Sep 9 '18 at 20:07
I've made a codepen that overlays the original path data with your rendering as precisely as possible. The path data have been divided into the (single-polygon) groups and converted to absolute by Inkscape. Unfortunately, the program cannot convert them to polygon primitives, so there are still V and H commands in there.
– ccprog
Sep 9 '18 at 20:11
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I've also submitted this to the svg-path-parser module on github
– geotheory
Sep 9 '18 at 9:57