Ruby space character does not equal space character
Ruby space character does not equal space character
In the following statements, one of the spaces is from a user's input (I copied the user's character from a remote Rails Console (an ActiveRecord field), and pasted it), and the other is from my keyboard. The statements return false:
false
" " == " " # => false
" ".include? " " # => false
Any ideas on why/how this might be happening?
[" ".ord, " ".ord]
[32, 160]
UTF-8 or non-printing characters, possibly. Just because it looks like a space doesn’t mean it is. Another reason to never trust user input!
– Todd A. Jacobs
Aug 30 at 2:50
“I figured it must be some sort of encoding issue” —this is by no mean an encoding issue. The user typed a non-breakable space. Smart people tune their keyboards nowadays to be able to type typographically correct stuff, like proper “quotes” and ‘apostrophes’ instead of typewriter’s crap. Also spaces, en–dashes, em—dashes, and even hearts ❤.
– Aleksei Matiushkin
Aug 30 at 4:07
@sawa SO parser converts whatever space to the normal ASCII, so there is no way to paste it properly here. Although, the issue is well described.
– Aleksei Matiushkin
Aug 30 at 6:05
@sawa sorry, not whatever space :) only nbsp. Spaces from my answer are preserved. I will change the OP to one of those to make the problem reproducible. → here you go, it’s now reproducible.
– Aleksei Matiushkin
Aug 30 at 6:08
2 Answers
2
To validate user input for blankness, one should not use == and/or include? helpers. One should use the modern regular expression, that matchees spaces.
==
include?
FYI: there are more than ten whitespace characters in UTF-8 specs, including, but not limited to:
spaces =
space_medium_mathematical_space: " ",
spaces_em_quad: " ",
spaces_em_space: " ",
spaces_en_quad: " ",
spaces_en_space: " ",
spaces_figure_space: " ",
spaces_four_per_em_space: " ",
spaces_hair_space: " ",
spaces_punctuation_space: " ",
spaces_six_per_em_space: " ",
spaces_thin_space: " ",
spaces_three_per_em_space: " "
To match them, one uses pSpace matcher.
pSpace
spaces.values.map s
#⇒ [false, false, false, false, false, false,
# false, false, false, false, false, false]
But:
spaces.values.map(&/ApSpace*z/.method(:match?))
#⇒ [true, true, true, true, true, true,
# true, true, true, true, true, true]
Nitpick alert: You meant Unicode specs. UTF-8 doesn't say that e.g. 8200 is a space, just that one should use bytes 226, 128 and 136 to represent it.
– Amadan
Aug 30 at 7:54
unicode.org/versions/Unicode11.0.0/ch06.pdf (p. 264) contains a table with the available spaces and some description.
– Stefan
Aug 30 at 7:58
@Amadan indeed. FWIW: the spaces snippet above is taken from my Elixir pet project
StringNaming where I parse unicode.org/Public/10.0.0/ucd/NamesList.txt from Unicode specs to get names and values.– Aleksei Matiushkin
Aug 30 at 8:03
StringNaming
I believe you can utilize the String#unicode_normalize. It has several normalization forms that are documented at unicode.org. Seems like :nfkc and :nfkd suit this purpose.
:nfkc
:nfkd
s = "foo bar" # <-- includes a non breaking space
space = " " # <-- regular space
s.include?(space) # => false
s.unicode_normalize(:nfkc).include?(space) # => true
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Try checking them
[" ".ord, " ".ord], if you get[32, 160], you have a whitespace and a non-breaking whitespace.– Sebastian Palma
Aug 30 at 2:44