Ruby space character does not equal space character

Ruby space character does not equal space character



In the following statements, one of the spaces is from a user's input (I copied the user's character from a remote Rails Console (an ActiveRecord field), and pasted it), and the other is from my keyboard. The statements return false:


false


" " == " " # => false
" ".include? " " # => false



Any ideas on why/how this might be happening?





Try checking them [" ".ord, " ".ord], if you get [32, 160], you have a whitespace and a non-breaking whitespace.
– Sebastian Palma
Aug 30 at 2:44


[" ".ord, " ".ord]


[32, 160]





UTF-8 or non-printing characters, possibly. Just because it looks like a space doesn’t mean it is. Another reason to never trust user input!
– Todd A. Jacobs
Aug 30 at 2:50





“I figured it must be some sort of encoding issue” —this is by no mean an encoding issue. The user typed a non-breakable space. Smart people tune their keyboards nowadays to be able to type typographically correct stuff, like proper “quotes” and ‘apostrophes’ instead of typewriter’s crap. Also spaces, en–dashes, em—dashes, and even hearts ❤.
– Aleksei Matiushkin
Aug 30 at 4:07






@sawa SO parser converts whatever space to the normal ASCII, so there is no way to paste it properly here. Although, the issue is well described.
– Aleksei Matiushkin
Aug 30 at 6:05





@sawa sorry, not whatever space :) only nbsp. Spaces from my answer are preserved. I will change the OP to one of those to make the problem reproducible. → here you go, it’s now reproducible.
– Aleksei Matiushkin
Aug 30 at 6:08





2 Answers
2



To validate user input for blankness, one should not use == and/or include? helpers. One should use the modern regular expression, that matchees spaces.


==


include?



FYI: there are more than ten whitespace characters in UTF-8 specs, including, but not limited to:


spaces =
space_medium_mathematical_space: " ",
spaces_em_quad: " ",
spaces_em_space: " ",
spaces_en_quad: " ",
spaces_en_space: " ",
spaces_figure_space: " ",
spaces_four_per_em_space: " ",
spaces_hair_space: " ",
spaces_punctuation_space: " ",
spaces_six_per_em_space: " ",
spaces_thin_space: " ",
spaces_three_per_em_space: " "



To match them, one uses pSpace matcher.


pSpace


spaces.values.map s
#⇒ [false, false, false, false, false, false,
# false, false, false, false, false, false]



But:


spaces.values.map(&/ApSpace*z/.method(:match?))
#⇒ [true, true, true, true, true, true,
# true, true, true, true, true, true]





Nitpick alert: You meant Unicode specs. UTF-8 doesn't say that e.g. 8200 is a space, just that one should use bytes 226, 128 and 136 to represent it.
– Amadan
Aug 30 at 7:54





unicode.org/versions/Unicode11.0.0/ch06.pdf (p. 264) contains a table with the available spaces and some description.
– Stefan
Aug 30 at 7:58





@Amadan indeed. FWIW: the spaces snippet above is taken from my Elixir pet project StringNaming where I parse unicode.org/Public/10.0.0/ucd/NamesList.txt from Unicode specs to get names and values.
– Aleksei Matiushkin
Aug 30 at 8:03



StringNaming



I believe you can utilize the String#unicode_normalize. It has several normalization forms that are documented at unicode.org. Seems like :nfkc and :nfkd suit this purpose.


:nfkc


:nfkd


s = "foo bar" # <-- includes a non breaking space
space = " " # <-- regular space

s.include?(space) # => false
s.unicode_normalize(:nfkc).include?(space) # => true



Required, but never shown



Required, but never shown






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ

Node.js puppeteer - Use values from array in a loop to cycle through pages