Removing duplicate characters from two argument strings in C

Removing duplicate characters from two argument strings in C



I'm trying to optimize a problem that I have to make it more readable with the same speed optimization. My problem consists in this:



Allowed function: write.c, nothing else.



Write a program that takes two strings and displays, without doubles, the
characters that appear in either one of the strings.



The display will be in the order characters appear in the command line, and
will be followed by a n.



As you can see, in the main it will take two of your argument strings (argv[1] and argv[2]) into our function (void remove_dup(char *str, char *str2) after is compiling it with GCC. That temporary array will hold the ASCII value of the character after a duplicate is detected. For example, str1 = "hello" and str2 = "laoblc". The expected output will result as "heloabc" using the write function.


argv[1]


argv[2]


void remove_dup(char *str, char *str2)


str1 = "hello"


str2 = "laoblc"



However, GCC was complaining because I have an array subscript with my temporary character array filled in with zeroes from the index of my strings. To stop making the compiler complaint, I had to cast the string index as an int to hold the ASCII value inside my temporary array. This will be our checker, which will determine if there is a duplicate in our string depending on the value of the character. Recompiling it again, but this time using warning flags: gcc -Wextra -Werror -Wall remove_dup.c. This is the error that I get:


gcc -Wextra -Werror -Wall remove_dup.c



remove_dup:11 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]


if (temp[str[i]] == 0)
^~~~~~~



remove_dup:13 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]


temp[str[i]] = 1;
^~~~~~~



remove_dup:21 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]


if (temp[str2[i]] == 0)
^~~~~~~~



remove_dup.c:23 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]


temp[str2[i]] = 1;
^~~~~~~~



Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array? This program is running as O(m + n) where m is our first string and n is our second string.


O(m + n)


m


n



This is the code:


void remove_dup(char *str, char *str2)

int temp[10000] = 0;
int i;

i = 0;
while (str[i])

if (temp[(int)str[i]] == 0)

temp[(int)str[i]] = 1;
write(1, &str[i], 1);

i++;

i = 0;
while (str2[i])

if (temp[(int)str2[i]] == 0)

temp[(int)str2[i]] = 1;
write(1, &str2[i], 1);

i++;



int main(int argc, char *argv)

if (argc == 3)
remove_dup(argv[1], argv[2]);
write(1, "n", 1);
return (0);



I hope this is clear enough with the logic structure I explained. I might have grammar mistakes, so bear with me :).






AFAIK casting does not incur a runtime penalty. On a related note, temp only needs to be 256 elements long because that's how large char is.

– Schwern
Sep 17 '18 at 0:18



temp


char






I agree with Schwen. Your code looks good. A slightly more elegant way to remove the annoying gcc warning is to replace temp[(int)str[i]] with temp[+str[i]]. This will also avoid the explicit cast; it is best to avoid casts whenever possible.

– Joseph Quinsey
Sep 17 '18 at 0:30


temp[(int)str[i]]


temp[+str[i]]






Correction: Unless your chars are signed!

– Joseph Quinsey
Sep 17 '18 at 0:33


signed






@JosephQuinsey can you explain why chars must be signed in this case?

– Zeid Tisnes
Sep 17 '18 at 0:50





2 Answers
2



Casting here will have no performance penalty.



However, as a rule of thumb, it is generally best to avoid explicit casts whenever possible. You can do this by for example by changing:


temp[(int)str[i]]



to:


temp[+str[i]]



This will work by the usual arithmetic conversions.



However, your code has another problem. You could ask: why would gcc bother to issue such an annoying warning message?



One answer is that they just like to be annoying. A better guess is that on most platforms char is signed-- see Is char signed or unsigned by default? --and so if your string happen to have an ASCII char greater than 127 (i.e. less than zero), you will have a bug.


char


signed



One way to fix this is to replace:


temp[(int)str[i]]



with:


temp[str[i] + 128]



(and change int temp[10000] = 0 to int temp[256 + 128] = 0). This will work regardless of the default sign of char.


int temp[10000] = 0


int temp[256 + 128] = 0


char






As I was testing this, it will not complain. However, it will print both strings with the duplicates. if I make a change for temp[str[i] + 128] to temp[+str[i]] it will make it work.

– Zeid Tisnes
Sep 17 '18 at 1:38


temp[str[i] + 128]


temp[+str[i]]






Did you change the two instances of temp[(int)str[i]] and the two instances of temp[(int)str2[i]]?

– Joseph Quinsey
Sep 17 '18 at 1:41


temp[(int)str[i]]


temp[(int)str2[i]]






Sorry if the above comment is obvious--I make dumb mistakes sometimes.

– Joseph Quinsey
Sep 17 '18 at 1:44






Edit: Regarding the OP's comment "This program is running as O(m * n) where m is our first string and n is our second string.", the time is in fact O(m + n).

– Joseph Quinsey
Sep 17 '18 at 1:53


O(m * n)


m


n


O(m + n)






No worries, we all do to be honest. It works fine now after setting both instances to temp[str[i] + 128] or temp[+str[i]]. I thought to have one instance changed into temp[+str[i]] and the other to temp[str[i] + 128] will make the same asnwer for both. Edit: Yes, you are right in the O(m + n). I will change it.

– Zeid Tisnes
Sep 17 '18 at 1:53



temp[str[i] + 128]


temp[+str[i]]


temp[+str[i]]


temp[str[i] + 128]


O(m + n)



Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array?



I don't believe casting in C has a runtime penalty. Everything in C is a number anyway. I believe it's just telling the compiler that yes, you know you're using the wrong type and believe it's ok.



Note that char can be signed. It is possible for a negative number to sneak in there.


char



This program is running as O(m * n) where m is our first string and n is our second string.



No, it's running as O(n). O(m*n) would be if you were iterating over one string for every character of the other.


for( int i = 0; i < strlen(str1); i++ )
for( int j = 0; j < strlen(str2); j++ )
...




But you're looping over each string one after the other in two independent loops. This is O(m + n) which is O(n).



On to improvements. First, temp only ever needs to hold the char range which is, at most, 256. Let's give it a variable name that describes what it does, chars_seen.


temp


char


256


chars_seen



Finally, there's no need to store a full integer. Normally we'd use bool from stdbool.h, but we can define our own using signed char which is what stdbool.h is likely to do. We're sure to wrap it in an #ifndef bool so we use the system supplied one if available, it will know better than we do what type to use for a boolean.


bool


stdbool.h


signed char


stdbool.h


#ifndef bool


#ifndef bool
typedef signed char bool;
#endif
bool chars_seen[256] = 0;



You might be able to get a bit more performance by eliminating i and instead increment the pointer directly. Not only more performance, but this makes many string and array operations simpler.


i


for( ; *str != ''; str++ )
if( !chars_seen[(size_t)*str] )
chars_seen[(size_t)*str] = 1;
write(1, str, 1);




Note that I'm casting to size_t, not int, because that is the proper type for an index.


size_t


int



You might be able to shave a touch off by using post-increment, whether this helps is going to depend on your compiler.


if( !chars_seen[(size_t)*str]++ )
write(1, str, 1);



Finally, to avoid repeating your code and to extend it to work with any number of strings, we can write a function which takes in the set of characters seen and displays one string. And we'll give the compiler the hint to inline it, though it's of questionable use.


inline void display_chars_no_dups( const char *str, bool chars_seen)
for( ; *str != ''; str++ )
if( !chars_seen[(size_t)*str]++ )
write(1, str, 1);





Then main allocates the array of seen characters and calls the function as many times as necessary.


main


int main(int argc, char *argv)
bool chars_seen[256] = 0;

for( int i = 1; i < argc; i++ )
display_chars_no_dups( argv[i], chars_seen );

write(1, "n", 1);



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)