How best to compare two compiled binaries? [closed]
up vote
-2
down vote
favorite
I recently discovered an excellent visual studio extension which finds unnecessary #include statements in projects and removes them. I work on some gnarly legacy code and it's stripped a huge amount away. The only problem is that I can't be sure that it hasn't altered the build in some subtle way. It occurs to me that a project may still build but a #define somewhere could have been altered.
Anyway, it's occurred to me that I could be sure that no important changes have been made by checking the binaries. I was wondering if anyone had any advice on how best to do this? The obvious problem is that a small amount of meta data in the binaries will change because of compiler metadata about build times, etc.
Ideas so far:
- Disassemble all the binaries and compare the disassembly with diff. (Although this wont't cover the data sections I guess).
- Use some kind of binary diff program that's aware of PE headers.
Any ideas? And anyone know of a tool which understands PE headers as I describe?
c++ c windows
closed as off-topic by chux, dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm Nov 8 at 21:18
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." – dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm
|
show 7 more comments
up vote
-2
down vote
favorite
I recently discovered an excellent visual studio extension which finds unnecessary #include statements in projects and removes them. I work on some gnarly legacy code and it's stripped a huge amount away. The only problem is that I can't be sure that it hasn't altered the build in some subtle way. It occurs to me that a project may still build but a #define somewhere could have been altered.
Anyway, it's occurred to me that I could be sure that no important changes have been made by checking the binaries. I was wondering if anyone had any advice on how best to do this? The obvious problem is that a small amount of meta data in the binaries will change because of compiler metadata about build times, etc.
Ideas so far:
- Disassemble all the binaries and compare the disassembly with diff. (Although this wont't cover the data sections I guess).
- Use some kind of binary diff program that's aware of PE headers.
Any ideas? And anyone know of a tool which understands PE headers as I describe?
c++ c windows
closed as off-topic by chux, dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm Nov 8 at 21:18
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." – dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm
5
"And anyone know of a good binary diff program like I describe?" At almost 20k rep you should understand that you're walking dangerously close to off-topic-ness here. :)
– HolyBlackCat
Nov 8 at 19:58
Yeh, I guess that's true. But this is the kind of question only fellow programmers will likely know the answer to.
– Benj
Nov 8 at 20:00
no important changes have been made by checking the binaries Given that you've modified gnarly legacy code and it's stripped a huge amount away, you've made significant changes. What kind of testing do you do? Because you have to redo ALL of it now.
– Andrew Henle
Nov 8 at 20:00
3
Some compilers tend to be non deterministic. Even the same input code is not guaranteed to generate the same output. Checking semantic equality of binaries is a "hard" problem. You need to rely on your test cases to be sure that nothing has broken.
– Ajay Brahmakshatriya
Nov 8 at 20:08
1
You don't need to disassemble the binary, you can generate assembly using the-Soption in gcc and clang. I remember cl has/FAflag. Be careful about the line numbers and other debug information though. You can strip it off from the output to retain only the instructions.
– Ajay Brahmakshatriya
Nov 8 at 20:18
|
show 7 more comments
up vote
-2
down vote
favorite
up vote
-2
down vote
favorite
I recently discovered an excellent visual studio extension which finds unnecessary #include statements in projects and removes them. I work on some gnarly legacy code and it's stripped a huge amount away. The only problem is that I can't be sure that it hasn't altered the build in some subtle way. It occurs to me that a project may still build but a #define somewhere could have been altered.
Anyway, it's occurred to me that I could be sure that no important changes have been made by checking the binaries. I was wondering if anyone had any advice on how best to do this? The obvious problem is that a small amount of meta data in the binaries will change because of compiler metadata about build times, etc.
Ideas so far:
- Disassemble all the binaries and compare the disassembly with diff. (Although this wont't cover the data sections I guess).
- Use some kind of binary diff program that's aware of PE headers.
Any ideas? And anyone know of a tool which understands PE headers as I describe?
c++ c windows
I recently discovered an excellent visual studio extension which finds unnecessary #include statements in projects and removes them. I work on some gnarly legacy code and it's stripped a huge amount away. The only problem is that I can't be sure that it hasn't altered the build in some subtle way. It occurs to me that a project may still build but a #define somewhere could have been altered.
Anyway, it's occurred to me that I could be sure that no important changes have been made by checking the binaries. I was wondering if anyone had any advice on how best to do this? The obvious problem is that a small amount of meta data in the binaries will change because of compiler metadata about build times, etc.
Ideas so far:
- Disassemble all the binaries and compare the disassembly with diff. (Although this wont't cover the data sections I guess).
- Use some kind of binary diff program that's aware of PE headers.
Any ideas? And anyone know of a tool which understands PE headers as I describe?
c++ c windows
c++ c windows
edited Nov 8 at 20:06
asked Nov 8 at 19:55
Benj
19.7k1159110
19.7k1159110
closed as off-topic by chux, dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm Nov 8 at 21:18
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." – dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm
closed as off-topic by chux, dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm Nov 8 at 21:18
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." – dbush, SergeyA, Neil Butterworth, 1201ProgramAlarm
5
"And anyone know of a good binary diff program like I describe?" At almost 20k rep you should understand that you're walking dangerously close to off-topic-ness here. :)
– HolyBlackCat
Nov 8 at 19:58
Yeh, I guess that's true. But this is the kind of question only fellow programmers will likely know the answer to.
– Benj
Nov 8 at 20:00
no important changes have been made by checking the binaries Given that you've modified gnarly legacy code and it's stripped a huge amount away, you've made significant changes. What kind of testing do you do? Because you have to redo ALL of it now.
– Andrew Henle
Nov 8 at 20:00
3
Some compilers tend to be non deterministic. Even the same input code is not guaranteed to generate the same output. Checking semantic equality of binaries is a "hard" problem. You need to rely on your test cases to be sure that nothing has broken.
– Ajay Brahmakshatriya
Nov 8 at 20:08
1
You don't need to disassemble the binary, you can generate assembly using the-Soption in gcc and clang. I remember cl has/FAflag. Be careful about the line numbers and other debug information though. You can strip it off from the output to retain only the instructions.
– Ajay Brahmakshatriya
Nov 8 at 20:18
|
show 7 more comments
5
"And anyone know of a good binary diff program like I describe?" At almost 20k rep you should understand that you're walking dangerously close to off-topic-ness here. :)
– HolyBlackCat
Nov 8 at 19:58
Yeh, I guess that's true. But this is the kind of question only fellow programmers will likely know the answer to.
– Benj
Nov 8 at 20:00
no important changes have been made by checking the binaries Given that you've modified gnarly legacy code and it's stripped a huge amount away, you've made significant changes. What kind of testing do you do? Because you have to redo ALL of it now.
– Andrew Henle
Nov 8 at 20:00
3
Some compilers tend to be non deterministic. Even the same input code is not guaranteed to generate the same output. Checking semantic equality of binaries is a "hard" problem. You need to rely on your test cases to be sure that nothing has broken.
– Ajay Brahmakshatriya
Nov 8 at 20:08
1
You don't need to disassemble the binary, you can generate assembly using the-Soption in gcc and clang. I remember cl has/FAflag. Be careful about the line numbers and other debug information though. You can strip it off from the output to retain only the instructions.
– Ajay Brahmakshatriya
Nov 8 at 20:18
5
5
"And anyone know of a good binary diff program like I describe?" At almost 20k rep you should understand that you're walking dangerously close to off-topic-ness here. :)
– HolyBlackCat
Nov 8 at 19:58
"And anyone know of a good binary diff program like I describe?" At almost 20k rep you should understand that you're walking dangerously close to off-topic-ness here. :)
– HolyBlackCat
Nov 8 at 19:58
Yeh, I guess that's true. But this is the kind of question only fellow programmers will likely know the answer to.
– Benj
Nov 8 at 20:00
Yeh, I guess that's true. But this is the kind of question only fellow programmers will likely know the answer to.
– Benj
Nov 8 at 20:00
no important changes have been made by checking the binaries Given that you've modified gnarly legacy code and it's stripped a huge amount away, you've made significant changes. What kind of testing do you do? Because you have to redo ALL of it now.
– Andrew Henle
Nov 8 at 20:00
no important changes have been made by checking the binaries Given that you've modified gnarly legacy code and it's stripped a huge amount away, you've made significant changes. What kind of testing do you do? Because you have to redo ALL of it now.
– Andrew Henle
Nov 8 at 20:00
3
3
Some compilers tend to be non deterministic. Even the same input code is not guaranteed to generate the same output. Checking semantic equality of binaries is a "hard" problem. You need to rely on your test cases to be sure that nothing has broken.
– Ajay Brahmakshatriya
Nov 8 at 20:08
Some compilers tend to be non deterministic. Even the same input code is not guaranteed to generate the same output. Checking semantic equality of binaries is a "hard" problem. You need to rely on your test cases to be sure that nothing has broken.
– Ajay Brahmakshatriya
Nov 8 at 20:08
1
1
You don't need to disassemble the binary, you can generate assembly using the
-S option in gcc and clang. I remember cl has /FA flag. Be careful about the line numbers and other debug information though. You can strip it off from the output to retain only the instructions.– Ajay Brahmakshatriya
Nov 8 at 20:18
You don't need to disassemble the binary, you can generate assembly using the
-S option in gcc and clang. I remember cl has /FA flag. Be careful about the line numbers and other debug information though. You can strip it off from the output to retain only the instructions.– Ajay Brahmakshatriya
Nov 8 at 20:18
|
show 7 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
The PE header is always at the same place and ranges only up to 512 Bytes (exactly).
so just truncate off the first 512 bytes and compare the results then.
I pipe them through xxd to convert the files to hex, then I diff the resulting text files (any text diff program will work, but you need git commandline to get xxd).
xxd -p -c 4 < Truncatedfile1.exe > output.diff1
or
tail -n -512 < File1.exe | xxd -p -c 4 > output1.hex
tail -n -512 < File2.exe | xxd -p -c 4 > output2.hex
git diff --no-index --color output1.hex output2.hex
Note that I made the lines just 4 bytes long to have a chance that alignment (especially occurring in data sections) shuffles me the lines back in shape when an odd number of bytes is inserted in between. If you are extra lucky, your code is also DWORD-aligned, then it works with your code just as well.
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
The PE header is always at the same place and ranges only up to 512 Bytes (exactly).
so just truncate off the first 512 bytes and compare the results then.
I pipe them through xxd to convert the files to hex, then I diff the resulting text files (any text diff program will work, but you need git commandline to get xxd).
xxd -p -c 4 < Truncatedfile1.exe > output.diff1
or
tail -n -512 < File1.exe | xxd -p -c 4 > output1.hex
tail -n -512 < File2.exe | xxd -p -c 4 > output2.hex
git diff --no-index --color output1.hex output2.hex
Note that I made the lines just 4 bytes long to have a chance that alignment (especially occurring in data sections) shuffles me the lines back in shape when an odd number of bytes is inserted in between. If you are extra lucky, your code is also DWORD-aligned, then it works with your code just as well.
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
add a comment |
up vote
1
down vote
The PE header is always at the same place and ranges only up to 512 Bytes (exactly).
so just truncate off the first 512 bytes and compare the results then.
I pipe them through xxd to convert the files to hex, then I diff the resulting text files (any text diff program will work, but you need git commandline to get xxd).
xxd -p -c 4 < Truncatedfile1.exe > output.diff1
or
tail -n -512 < File1.exe | xxd -p -c 4 > output1.hex
tail -n -512 < File2.exe | xxd -p -c 4 > output2.hex
git diff --no-index --color output1.hex output2.hex
Note that I made the lines just 4 bytes long to have a chance that alignment (especially occurring in data sections) shuffles me the lines back in shape when an odd number of bytes is inserted in between. If you are extra lucky, your code is also DWORD-aligned, then it works with your code just as well.
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
add a comment |
up vote
1
down vote
up vote
1
down vote
The PE header is always at the same place and ranges only up to 512 Bytes (exactly).
so just truncate off the first 512 bytes and compare the results then.
I pipe them through xxd to convert the files to hex, then I diff the resulting text files (any text diff program will work, but you need git commandline to get xxd).
xxd -p -c 4 < Truncatedfile1.exe > output.diff1
or
tail -n -512 < File1.exe | xxd -p -c 4 > output1.hex
tail -n -512 < File2.exe | xxd -p -c 4 > output2.hex
git diff --no-index --color output1.hex output2.hex
Note that I made the lines just 4 bytes long to have a chance that alignment (especially occurring in data sections) shuffles me the lines back in shape when an odd number of bytes is inserted in between. If you are extra lucky, your code is also DWORD-aligned, then it works with your code just as well.
The PE header is always at the same place and ranges only up to 512 Bytes (exactly).
so just truncate off the first 512 bytes and compare the results then.
I pipe them through xxd to convert the files to hex, then I diff the resulting text files (any text diff program will work, but you need git commandline to get xxd).
xxd -p -c 4 < Truncatedfile1.exe > output.diff1
or
tail -n -512 < File1.exe | xxd -p -c 4 > output1.hex
tail -n -512 < File2.exe | xxd -p -c 4 > output2.hex
git diff --no-index --color output1.hex output2.hex
Note that I made the lines just 4 bytes long to have a chance that alignment (especially occurring in data sections) shuffles me the lines back in shape when an odd number of bytes is inserted in between. If you are extra lucky, your code is also DWORD-aligned, then it works with your code just as well.
edited Nov 8 at 20:22
answered Nov 8 at 20:16
Ohnemichel
165
165
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
add a comment |
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Thanks! So it sounds like this is a method that's worked for you? Some in the comments are saying that compilers are too non-deterministic for this idea to work.
– Benj
Nov 8 at 20:58
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
Well, I do edit binaries with automated stuff on binary scale, and as a Software tester, I have some experience in checking for differences. I just thought about how the data sections could be diffed and remembered that it is usually dword-aligned, so I thought that xxd with 4 bytes per line would be a good Idea also for re-compiled programs.
– Ohnemichel
Nov 8 at 21:13
add a comment |
5
"And anyone know of a good binary diff program like I describe?" At almost 20k rep you should understand that you're walking dangerously close to off-topic-ness here. :)
– HolyBlackCat
Nov 8 at 19:58
Yeh, I guess that's true. But this is the kind of question only fellow programmers will likely know the answer to.
– Benj
Nov 8 at 20:00
no important changes have been made by checking the binaries Given that you've modified gnarly legacy code and it's stripped a huge amount away, you've made significant changes. What kind of testing do you do? Because you have to redo ALL of it now.
– Andrew Henle
Nov 8 at 20:00
3
Some compilers tend to be non deterministic. Even the same input code is not guaranteed to generate the same output. Checking semantic equality of binaries is a "hard" problem. You need to rely on your test cases to be sure that nothing has broken.
– Ajay Brahmakshatriya
Nov 8 at 20:08
1
You don't need to disassemble the binary, you can generate assembly using the
-Soption in gcc and clang. I remember cl has/FAflag. Be careful about the line numbers and other debug information though. You can strip it off from the output to retain only the instructions.– Ajay Brahmakshatriya
Nov 8 at 20:18