Slice a string containing Unicode chars









up vote
16
down vote

favorite












I have a piece of text with characters of different bytelength.



let text = "Hello привет";


I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this



let slice = &text[start..end];


and got the following error



thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'


I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:



slice = text[start:end] ?



I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?










share|improve this question



















  • 2




    I think chars() is the way to go here: text.chars().take(end).skip(start)
    – Tim Diekmann
    Aug 23 at 10:10










  • @TimDiekmann how do I convert the Take<Chars> to &str then if the API needs it?
    – Sasha Tsukanov
    Aug 23 at 10:17










  • You should call collect(). See this question stackoverflow.com/questions/37157926/…
    – ozkriff
    Aug 23 at 10:18







  • 1




    @ozkriff collect() will result in String, not in &str. This is why I didn't marked this as duplicate to your linked question.
    – Tim Diekmann
    Aug 23 at 10:31














up vote
16
down vote

favorite












I have a piece of text with characters of different bytelength.



let text = "Hello привет";


I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this



let slice = &text[start..end];


and got the following error



thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'


I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:



slice = text[start:end] ?



I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?










share|improve this question



















  • 2




    I think chars() is the way to go here: text.chars().take(end).skip(start)
    – Tim Diekmann
    Aug 23 at 10:10










  • @TimDiekmann how do I convert the Take<Chars> to &str then if the API needs it?
    – Sasha Tsukanov
    Aug 23 at 10:17










  • You should call collect(). See this question stackoverflow.com/questions/37157926/…
    – ozkriff
    Aug 23 at 10:18







  • 1




    @ozkriff collect() will result in String, not in &str. This is why I didn't marked this as duplicate to your linked question.
    – Tim Diekmann
    Aug 23 at 10:31












up vote
16
down vote

favorite









up vote
16
down vote

favorite











I have a piece of text with characters of different bytelength.



let text = "Hello привет";


I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this



let slice = &text[start..end];


and got the following error



thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'


I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:



slice = text[start:end] ?



I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?










share|improve this question















I have a piece of text with characters of different bytelength.



let text = "Hello привет";


I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this



let slice = &text[start..end];


and got the following error



thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'


I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:



slice = text[start:end] ?



I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?







string unicode rust slice






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 23 at 19:16









Matthieu M.

199k28265502




199k28265502










asked Aug 23 at 9:52









Sasha Tsukanov

644317




644317







  • 2




    I think chars() is the way to go here: text.chars().take(end).skip(start)
    – Tim Diekmann
    Aug 23 at 10:10










  • @TimDiekmann how do I convert the Take<Chars> to &str then if the API needs it?
    – Sasha Tsukanov
    Aug 23 at 10:17










  • You should call collect(). See this question stackoverflow.com/questions/37157926/…
    – ozkriff
    Aug 23 at 10:18







  • 1




    @ozkriff collect() will result in String, not in &str. This is why I didn't marked this as duplicate to your linked question.
    – Tim Diekmann
    Aug 23 at 10:31












  • 2




    I think chars() is the way to go here: text.chars().take(end).skip(start)
    – Tim Diekmann
    Aug 23 at 10:10










  • @TimDiekmann how do I convert the Take<Chars> to &str then if the API needs it?
    – Sasha Tsukanov
    Aug 23 at 10:17










  • You should call collect(). See this question stackoverflow.com/questions/37157926/…
    – ozkriff
    Aug 23 at 10:18







  • 1




    @ozkriff collect() will result in String, not in &str. This is why I didn't marked this as duplicate to your linked question.
    – Tim Diekmann
    Aug 23 at 10:31







2




2




I think chars() is the way to go here: text.chars().take(end).skip(start)
– Tim Diekmann
Aug 23 at 10:10




I think chars() is the way to go here: text.chars().take(end).skip(start)
– Tim Diekmann
Aug 23 at 10:10












@TimDiekmann how do I convert the Take<Chars> to &str then if the API needs it?
– Sasha Tsukanov
Aug 23 at 10:17




@TimDiekmann how do I convert the Take<Chars> to &str then if the API needs it?
– Sasha Tsukanov
Aug 23 at 10:17












You should call collect(). See this question stackoverflow.com/questions/37157926/…
– ozkriff
Aug 23 at 10:18





You should call collect(). See this question stackoverflow.com/questions/37157926/…
– ozkriff
Aug 23 at 10:18





1




1




@ozkriff collect() will result in String, not in &str. This is why I didn't marked this as duplicate to your linked question.
– Tim Diekmann
Aug 23 at 10:31




@ozkriff collect() will result in String, not in &str. This is why I didn't marked this as duplicate to your linked question.
– Tim Diekmann
Aug 23 at 10:31












2 Answers
2






active

oldest

votes

















up vote
23
down vote



accepted










Possible solutions to codepoint slicing




I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?




If you know the exact byte indices, you can slice a string:



let text = "Hello привет";
println!("", &text[2..10]);


This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):



let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("", &text[2..idx]);


As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.



let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("", text_vec[2..8].iter().cloned().collect::<String>());


Why is this not easier?



As you can see, neither of these solutions is all that great. This is intentional, for two reasons:



As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).



But the more important reason:



Unicode codepoints are generally not a useful unit



What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).



But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:



>>> s = "Jürgen"
>>> s[0:2]
'Ju'


Surprising, right? This is because the string above is:




  • 0x004A LATIN CAPITAL LETTER J


  • 0x0075 LATIN SMALL LETTER U


  • 0x0308 COMBINING DIAERESIS

  • ...

This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.



Another example:



>>> s = "fire"
>>> s[0:2]
'fir'


Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.



There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.



So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.




Further resources on this topic:



  • Blogpost "Let's stop ascribing meaning to unicode codepoints"

  • Blogpost "Breaking our Latin-1 assumptions

  • http://utf8everywhere.org/





share|improve this answer






















  • To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
    – Sasha Tsukanov
    Aug 23 at 13:58

















up vote
7
down vote













An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.



You may use str::char_indices for solving this (remember, that getting to a position in UTF-8 is O(n)):



fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> 
assert!(end >= start);
string.char_indices().nth(start).and_then(


playground



You may use str::chars() if you are fine with getting a String:



let string: String = text.chars().take(end).skip(start).collect();





share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51982999%2fslice-a-string-containing-unicode-chars%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    23
    down vote



    accepted










    Possible solutions to codepoint slicing




    I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?




    If you know the exact byte indices, you can slice a string:



    let text = "Hello привет";
    println!("", &text[2..10]);


    This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):



    let text = "Hello привет";
    let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
    println!("", &text[2..idx]);


    As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.



    let text = "Hello привет";
    let text_vec = text.chars().collect::<Vec<_>>();
    println!("", text_vec[2..8].iter().cloned().collect::<String>());


    Why is this not easier?



    As you can see, neither of these solutions is all that great. This is intentional, for two reasons:



    As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).



    But the more important reason:



    Unicode codepoints are generally not a useful unit



    What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).



    But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:



    >>> s = "Jürgen"
    >>> s[0:2]
    'Ju'


    Surprising, right? This is because the string above is:




    • 0x004A LATIN CAPITAL LETTER J


    • 0x0075 LATIN SMALL LETTER U


    • 0x0308 COMBINING DIAERESIS

    • ...

    This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.



    Another example:



    >>> s = "fire"
    >>> s[0:2]
    'fir'


    Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.



    There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.



    So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.




    Further resources on this topic:



    • Blogpost "Let's stop ascribing meaning to unicode codepoints"

    • Blogpost "Breaking our Latin-1 assumptions

    • http://utf8everywhere.org/





    share|improve this answer






















    • To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
      – Sasha Tsukanov
      Aug 23 at 13:58














    up vote
    23
    down vote



    accepted










    Possible solutions to codepoint slicing




    I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?




    If you know the exact byte indices, you can slice a string:



    let text = "Hello привет";
    println!("", &text[2..10]);


    This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):



    let text = "Hello привет";
    let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
    println!("", &text[2..idx]);


    As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.



    let text = "Hello привет";
    let text_vec = text.chars().collect::<Vec<_>>();
    println!("", text_vec[2..8].iter().cloned().collect::<String>());


    Why is this not easier?



    As you can see, neither of these solutions is all that great. This is intentional, for two reasons:



    As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).



    But the more important reason:



    Unicode codepoints are generally not a useful unit



    What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).



    But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:



    >>> s = "Jürgen"
    >>> s[0:2]
    'Ju'


    Surprising, right? This is because the string above is:




    • 0x004A LATIN CAPITAL LETTER J


    • 0x0075 LATIN SMALL LETTER U


    • 0x0308 COMBINING DIAERESIS

    • ...

    This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.



    Another example:



    >>> s = "fire"
    >>> s[0:2]
    'fir'


    Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.



    There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.



    So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.




    Further resources on this topic:



    • Blogpost "Let's stop ascribing meaning to unicode codepoints"

    • Blogpost "Breaking our Latin-1 assumptions

    • http://utf8everywhere.org/





    share|improve this answer






















    • To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
      – Sasha Tsukanov
      Aug 23 at 13:58












    up vote
    23
    down vote



    accepted







    up vote
    23
    down vote



    accepted






    Possible solutions to codepoint slicing




    I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?




    If you know the exact byte indices, you can slice a string:



    let text = "Hello привет";
    println!("", &text[2..10]);


    This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):



    let text = "Hello привет";
    let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
    println!("", &text[2..idx]);


    As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.



    let text = "Hello привет";
    let text_vec = text.chars().collect::<Vec<_>>();
    println!("", text_vec[2..8].iter().cloned().collect::<String>());


    Why is this not easier?



    As you can see, neither of these solutions is all that great. This is intentional, for two reasons:



    As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).



    But the more important reason:



    Unicode codepoints are generally not a useful unit



    What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).



    But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:



    >>> s = "Jürgen"
    >>> s[0:2]
    'Ju'


    Surprising, right? This is because the string above is:




    • 0x004A LATIN CAPITAL LETTER J


    • 0x0075 LATIN SMALL LETTER U


    • 0x0308 COMBINING DIAERESIS

    • ...

    This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.



    Another example:



    >>> s = "fire"
    >>> s[0:2]
    'fir'


    Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.



    There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.



    So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.




    Further resources on this topic:



    • Blogpost "Let's stop ascribing meaning to unicode codepoints"

    • Blogpost "Breaking our Latin-1 assumptions

    • http://utf8everywhere.org/





    share|improve this answer














    Possible solutions to codepoint slicing




    I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?




    If you know the exact byte indices, you can slice a string:



    let text = "Hello привет";
    println!("", &text[2..10]);


    This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):



    let text = "Hello привет";
    let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
    println!("", &text[2..idx]);


    As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.



    let text = "Hello привет";
    let text_vec = text.chars().collect::<Vec<_>>();
    println!("", text_vec[2..8].iter().cloned().collect::<String>());


    Why is this not easier?



    As you can see, neither of these solutions is all that great. This is intentional, for two reasons:



    As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).



    But the more important reason:



    Unicode codepoints are generally not a useful unit



    What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).



    But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:



    >>> s = "Jürgen"
    >>> s[0:2]
    'Ju'


    Surprising, right? This is because the string above is:




    • 0x004A LATIN CAPITAL LETTER J


    • 0x0075 LATIN SMALL LETTER U


    • 0x0308 COMBINING DIAERESIS

    • ...

    This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.



    Another example:



    >>> s = "fire"
    >>> s[0:2]
    'fir'


    Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.



    There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.



    So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.




    Further resources on this topic:



    • Blogpost "Let's stop ascribing meaning to unicode codepoints"

    • Blogpost "Breaking our Latin-1 assumptions

    • http://utf8everywhere.org/






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 23 at 10:37

























    answered Aug 23 at 10:23









    Lukas Kalbertodt

    23.7k252107




    23.7k252107











    • To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
      – Sasha Tsukanov
      Aug 23 at 13:58
















    • To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
      – Sasha Tsukanov
      Aug 23 at 13:58















    To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
    – Sasha Tsukanov
    Aug 23 at 13:58




    To make let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
    – Sasha Tsukanov
    Aug 23 at 13:58












    up vote
    7
    down vote













    An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.



    You may use str::char_indices for solving this (remember, that getting to a position in UTF-8 is O(n)):



    fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> 
    assert!(end >= start);
    string.char_indices().nth(start).and_then(


    playground



    You may use str::chars() if you are fine with getting a String:



    let string: String = text.chars().take(end).skip(start).collect();





    share|improve this answer


























      up vote
      7
      down vote













      An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.



      You may use str::char_indices for solving this (remember, that getting to a position in UTF-8 is O(n)):



      fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> 
      assert!(end >= start);
      string.char_indices().nth(start).and_then(


      playground



      You may use str::chars() if you are fine with getting a String:



      let string: String = text.chars().take(end).skip(start).collect();





      share|improve this answer
























        up vote
        7
        down vote










        up vote
        7
        down vote









        An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.



        You may use str::char_indices for solving this (remember, that getting to a position in UTF-8 is O(n)):



        fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> 
        assert!(end >= start);
        string.char_indices().nth(start).and_then(


        playground



        You may use str::chars() if you are fine with getting a String:



        let string: String = text.chars().take(end).skip(start).collect();





        share|improve this answer














        An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.



        You may use str::char_indices for solving this (remember, that getting to a position in UTF-8 is O(n)):



        fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> 
        assert!(end >= start);
        string.char_indices().nth(start).and_then(


        playground



        You may use str::chars() if you are fine with getting a String:



        let string: String = text.chars().take(end).skip(start).collect();






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Aug 23 at 10:37

























        answered Aug 23 at 10:26









        Tim Diekmann

        2,80991633




        2,80991633



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51982999%2fslice-a-string-containing-unicode-chars%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

            ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

            ⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌