Removing duplicate characters from two argument strings in C
Removing duplicate characters from two argument strings in C
I'm trying to optimize a problem that I have to make it more readable with the same speed optimization. My problem consists in this:
Allowed function: write.c, nothing else.
Write a program that takes two strings and displays, without doubles, the
characters that appear in either one of the strings.
The display will be in the order characters appear in the command line, and
will be followed by a n.
As you can see, in the main it will take two of your argument strings (argv[1]
and argv[2]
) into our function (void remove_dup(char *str, char *str2)
after is compiling it with GCC. That temporary array will hold the ASCII value of the character after a duplicate is detected. For example, str1 = "hello"
and str2 = "laoblc"
. The expected output will result as "heloabc" using the write function.
argv[1]
argv[2]
void remove_dup(char *str, char *str2)
str1 = "hello"
str2 = "laoblc"
However, GCC was complaining because I have an array subscript with my temporary character array filled in with zeroes from the index of my strings. To stop making the compiler complaint, I had to cast the string index as an int to hold the ASCII value inside my temporary array. This will be our checker, which will determine if there is a duplicate in our string depending on the value of the character. Recompiling it again, but this time using warning flags: gcc -Wextra -Werror -Wall remove_dup.c
. This is the error that I get:
gcc -Wextra -Werror -Wall remove_dup.c
remove_dup:11 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
if (temp[str[i]] == 0)
^~~~~~~
remove_dup:13 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
temp[str[i]] = 1;
^~~~~~~
remove_dup:21 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
if (temp[str2[i]] == 0)
^~~~~~~~
remove_dup.c:23 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
temp[str2[i]] = 1;
^~~~~~~~
Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array? This program is running as O(m + n)
where m
is our first string and n
is our second string.
O(m + n)
m
n
This is the code:
void remove_dup(char *str, char *str2)
int temp[10000] = 0;
int i;
i = 0;
while (str[i])
if (temp[(int)str[i]] == 0)
temp[(int)str[i]] = 1;
write(1, &str[i], 1);
i++;
i = 0;
while (str2[i])
if (temp[(int)str2[i]] == 0)
temp[(int)str2[i]] = 1;
write(1, &str2[i], 1);
i++;
int main(int argc, char *argv)
if (argc == 3)
remove_dup(argv[1], argv[2]);
write(1, "n", 1);
return (0);
I hope this is clear enough with the logic structure I explained. I might have grammar mistakes, so bear with me :).
temp
char
I agree with Schwen. Your code looks good. A slightly more elegant way to remove the annoying gcc warning is to replace
temp[(int)str[i]]
with temp[+str[i]]
. This will also avoid the explicit cast; it is best to avoid casts whenever possible.– Joseph Quinsey
Sep 17 '18 at 0:30
temp[(int)str[i]]
temp[+str[i]]
Correction: Unless your chars are
signed
!– Joseph Quinsey
Sep 17 '18 at 0:33
signed
@JosephQuinsey can you explain why chars must be signed in this case?
– Zeid Tisnes
Sep 17 '18 at 0:50
2 Answers
2
Casting here will have no performance penalty.
However, as a rule of thumb, it is generally best to avoid explicit casts whenever possible. You can do this by for example by changing:
temp[(int)str[i]]
to:
temp[+str[i]]
This will work by the usual arithmetic conversions.
However, your code has another problem. You could ask: why would gcc bother to issue such an annoying warning message?
One answer is that they just like to be annoying. A better guess is that on most platforms char
is signed
-- see Is char signed or unsigned by default? --and so if your string happen to have an ASCII char greater than 127 (i.e. less than zero), you will have a bug.
char
signed
One way to fix this is to replace:
temp[(int)str[i]]
with:
temp[str[i] + 128]
(and change int temp[10000] = 0
to int temp[256 + 128] = 0
). This will work regardless of the default sign of char
.
int temp[10000] = 0
int temp[256 + 128] = 0
char
As I was testing this, it will not complain. However, it will print both strings with the duplicates. if I make a change for
temp[str[i] + 128]
to temp[+str[i]]
it will make it work.– Zeid Tisnes
Sep 17 '18 at 1:38
temp[str[i] + 128]
temp[+str[i]]
Did you change the two instances of
temp[(int)str[i]]
and the two instances of temp[(int)str2[i]]
?– Joseph Quinsey
Sep 17 '18 at 1:41
temp[(int)str[i]]
temp[(int)str2[i]]
Sorry if the above comment is obvious--I make dumb mistakes sometimes.
– Joseph Quinsey
Sep 17 '18 at 1:44
Edit: Regarding the OP's comment "This program is running as
O(m * n)
where m
is our first string and n
is our second string.", the time is in fact O(m + n)
.– Joseph Quinsey
Sep 17 '18 at 1:53
O(m * n)
m
n
O(m + n)
No worries, we all do to be honest. It works fine now after setting both instances to
temp[str[i] + 128]
or temp[+str[i]]
. I thought to have one instance changed into temp[+str[i]]
and the other to temp[str[i] + 128]
will make the same asnwer for both. Edit: Yes, you are right in the O(m + n)
. I will change it.– Zeid Tisnes
Sep 17 '18 at 1:53
temp[str[i] + 128]
temp[+str[i]]
temp[+str[i]]
temp[str[i] + 128]
O(m + n)
Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array?
I don't believe casting in C has a runtime penalty. Everything in C is a number anyway. I believe it's just telling the compiler that yes, you know you're using the wrong type and believe it's ok.
Note that char
can be signed. It is possible for a negative number to sneak in there.
char
This program is running as O(m * n) where m is our first string and n is our second string.
No, it's running as O(n). O(m*n) would be if you were iterating over one string for every character of the other.
for( int i = 0; i < strlen(str1); i++ )
for( int j = 0; j < strlen(str2); j++ )
...
But you're looping over each string one after the other in two independent loops. This is O(m + n) which is O(n).
On to improvements. First, temp
only ever needs to hold the char
range which is, at most, 256
. Let's give it a variable name that describes what it does, chars_seen
.
temp
char
256
chars_seen
Finally, there's no need to store a full integer. Normally we'd use bool
from stdbool.h
, but we can define our own using signed char
which is what stdbool.h
is likely to do. We're sure to wrap it in an #ifndef bool
so we use the system supplied one if available, it will know better than we do what type to use for a boolean.
bool
stdbool.h
signed char
stdbool.h
#ifndef bool
#ifndef bool
typedef signed char bool;
#endif
bool chars_seen[256] = 0;
You might be able to get a bit more performance by eliminating i
and instead increment the pointer directly. Not only more performance, but this makes many string and array operations simpler.
i
for( ; *str != ''; str++ )
if( !chars_seen[(size_t)*str] )
chars_seen[(size_t)*str] = 1;
write(1, str, 1);
Note that I'm casting to size_t
, not int
, because that is the proper type for an index.
size_t
int
You might be able to shave a touch off by using post-increment, whether this helps is going to depend on your compiler.
if( !chars_seen[(size_t)*str]++ )
write(1, str, 1);
Finally, to avoid repeating your code and to extend it to work with any number of strings, we can write a function which takes in the set of characters seen and displays one string. And we'll give the compiler the hint to inline it, though it's of questionable use.
inline void display_chars_no_dups( const char *str, bool chars_seen)
for( ; *str != ''; str++ )
if( !chars_seen[(size_t)*str]++ )
write(1, str, 1);
Then main
allocates the array of seen characters and calls the function as many times as necessary.
main
int main(int argc, char *argv)
bool chars_seen[256] = 0;
for( int i = 1; i < argc; i++ )
display_chars_no_dups( argv[i], chars_seen );
write(1, "n", 1);
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
AFAIK casting does not incur a runtime penalty. On a related note,
temp
only needs to be 256 elements long because that's how largechar
is.– Schwern
Sep 17 '18 at 0:18