#55 failure parsing archive metadata and crash (sigsegv) building DBus on Mac
Closed 2 months ago by RJVB. Opened 2 months ago by RJVB.

The DBus (1.14.6) build with link-time optimisation (LTO) fails on Mac with a segmentation violation in slibtool when creating libdbus-testutils.la, specifically just after the creation of the actual archive, libdbus-testutils.a .

On a number of occasions I'm seeing

Feb 11 22:53:06 Portia rlibtool[66047]: rlibtool(66047,0x7fff78f6c310) malloc: *** error for object 0x7fe7b9e00aa0: pointer being freed was not allocated
    *** set a breakpoint in malloc_error_break to debug

in my system.log but that disappears when I set MallocErrorAbort=1. The backtrace doesn't change though:

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: EXC_I386_GPFLT

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   rlibtool                        0x0000000100e34943 slbt_get_driver_fdctx + 35 (slbt_driver_ctx.c:1534)
1   rlibtool                        0x0000000100e2bb89 slbt_driver_fdcwd + 25 (slibtool_driver_impl.h:272)
2   rlibtool                        0x0000000100e2b54c slbt_store_archive + 268 (slbt_archive_store.c:73)
3   rlibtool                        0x0000000100e39ebf slbt_archive_import_impl + 319 (slbt_archive_import.c:46)
4   rlibtool                        0x0000000100e39d36 slbt_archive_import + 86 (slbt_archive_import.c:65)
5   rlibtool                        0x0000000100e4aff6 slbt_exec_link_create_archive + 1542 (slbt_exec_link.c:1398)
6   rlibtool                        0x0000000100e49902 slbt_exec_link + 2178 (slbt_exec_link.c:2034)
7   rlibtool                        0x0000000100e2f145 slbt_perform_driver_actions + 229 (slbt_amain.c:85)
8   rlibtool                        0x0000000100e2ed57 slbt_main + 1175 (slbt_amain.c:199)
9   rlibtool                        0x0000000100e5d5ce main + 46 (slibtool.c:14)
10  libdyld.dylib                   0x00007fff939285fd start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0xaffffffffffffd80  rbx: 0x00007ff192403b40  rcx: 0x0000000000000000  rdx: 0x00007fff5edd8ef0
  rdi: 0xb000000000000000  rsi: 0x00007fff5edd8740  rbp: 0x00007fff5edd8720  rsp: 0x00007fff5edd8700
   r8: 0x00007ff192500000   r9: 0x000000008e68dd72  r10: 0x000000001f8d6585  r11: 0xfffffffffffffa50
  r12: 0x0000000000000000  r13: 0x0000000000000000  r14: 0x0000000100e5dbb8  r15: 0x0000000000000000
  rip: 0x0000000100e34943  rfl: 0x0000000000010246  cr2: 0x00000001010cb000

Logical CPU:     0
Error Code:      0x00000000
Trap Number:     13

It looks like the "in-memory archive merging facility" in slbt_archive_import_impl might not be able to handle archives generated by llvm-ar that contain LLVM bitcode objects? Not sure if that's a proper diagnosis because the attached log shows that llvm-ar does the actual work. Plus, we're creating an archive from scratch so I don't know what's being merged here. Either way, the most logical explanation is that the slbt_archive_ctx *arctx is NULL (impossible because there are no checks against that?) or invalid/stale.

A quick look at slbt_merge_archives suggests that the call to slbt_merge_archives_fail at its very end could result in arctx == NULL without an error being returned!

dbus-build.log


Bingo, something indeed goes wrong trying to get the archive meta data(?). This is what I see when I insert a return in from of that last slbt_merge_archives_fail call :

:info:build rlibtool: link: /opt/local/bin/llvm-ar-mp-9.0 -crs .libs/libdbus-testutils.a .libs/disable-crash-handling.o .libs/test-utils.o
:info:build bash(76718,0x7fff78f6c310) malloc: enabling abort() on bad malloc or free
:info:build xcrun(76718,0x7fff78f6c310) malloc: enabling abort() on bad malloc or free
:info:build llvm-ar(76718,0x7fff78f6c310) malloc: enabling abort() on bad malloc or free
:info:build rlibtool: error logged in slbt_get_archive_meta(), line 405: flow error: unexpected condition or other.
:info:build rlibtool: < returned to > slbt_merge_archives(), line 683.
:info:build rlibtool: < returned to > slbt_archive_import_impl(), line 50.
:info:build rlibtool: < returned to > slbt_exec_link_create_archive(), line 1399.
:info:build rlibtool: < returned to > slbt_exec_link(), line 2039.
:info:build make[3]: *** [libdbus-testutils.la] Error 2

Line 405 is in fact an error trying to read the archive header(s), so it looks like this is indeed an incompatibility issue with the archive format.

Probably not nice to hear, but is it wise to rely only on an "in-memory archive merging facility" that you can't really hope to test on all platforms, without fallback to a traditional method that I presume uses external tools from the toolchain?

FWIW, is this in anyway related to ar's MRI parser? I used to have a patch to avoid attemps using that when using llvm-ar .

And a closing remark: building pkgconf with LTO and --enable-static completes just fine...

Bingo, something indeed goes wrong trying to get the archive meta data(?). This is what I see when I insert a return in from of that last slbt_merge_archives_fail call :

Thanks for the catch! That's very helpful.

Probably not nice to hear, but is it wise to rely only on an "in-memory archive merging facility" that you can't really hope to test on all platforms, without fallback to a traditional method that I presume uses external tools from the toolchain?

You're right, that's not nice to hear, but that's fine. Some context;

  • gnu ar's implementation of MRI scripts is buggy (changes order of members). Then again, not all systems provide gnu ar.

  • llvm-ar can merge archives (-qL), but again, not all systems come with llvm-ar, and adding llvm-ar as a dependency even on systems where it does exist is going to make bootstrapping tricky (llvm-ar, for instance, has a dependency on libxml2, which builds with libtool). Accordingly, resorting to llvm-ar is not something which can be considered an acceptable solution.

  • slibtool's ar parsing and the in-memory archive merging facility have been tested against all common archive flavors, namely: sysv, bsd, 32-bits, 64-bits, big endian, little endian.

  • slibtool's ar parsing is also needed for the implementation of -export-symbols and -export-symbols-regex.

  • indeed, we have not yet tested slibtool-ar against archives containing llvm bitcode, and that's a good point.

  • Regarding fallback options: I left the implementation of the MRI-based archive merging in the code base (under src/fallback), so that downstream users on systems with archives that cannot be supported by slibtool could use it. That would require a one-line patch in slbt_archive_import() --> check for failure, and resort to slbt_archive_import_mri() as needed.

  • since MRI scripts were found to be buggy, using slbt_archive_import_mri() is not going to be part of slibtool's official release.

  • The above slbt_archive_import_mri() function could be easily converted into slbt_archive_import_llvm() -- for now, though, I'd much rather understand what failed with the bitcode rather than resort to an external tool of a huge scale.

Line 405 is in fact an error trying to read the archive header(s), so it looks like this is indeed an incompatibility issue with the archive format.

Is that a thin archive by any chance? Can you paste the output of:

$ hexdump -C /path/to/archive.a | head

A quick look at slbt_merge_archives suggests that the call to slbt_merge_archives_fail at its very end could result in arctx == NULL without an error being returned!

I think you're misreading the code. slbt_merge_archives_fail() returns whatever ret value it was passed by the caller, which in our context is either SLBT_SYSTEM_ERROR(dctx,0) or SLBT_NESTED_ERROR(dctx), which both always return a negative (-1) result. That's common practice everywhere in slibotol where a static _fail() function is in use.

Reading again through the backtraces and looking at the relevant line numbers, it appears that:

  • the archives do get successfully merged (or else slbt_store_archive() would have not been invoked.

  • for whatever reason, dctx has an invalid value, which likely happens because the *arctx passed to it is invalid.

Since we have not seen something similar on our end, could you debug this using lldb, place breakpoints at slbt_merge_archives() and slbt_store_archive(), step through the execution, and see where things go wrong?

I think you're misreading the code. slbt_merge_archives_fail() returns whatever ret value it was passed by the caller, which in our context is either SLBT_SYSTEM_ERROR(dctx,0) or SLBT_NESTED_ERROR(dctx), which both always return a negative (-1) result. That's common practice everywhere in slibotol where a static _fail() function is in use.

Yes, but you're discarding the return value from slbt_merge_archives_fail and continuing execution in slbt_merge_archives ultimately letting that function return 0. But the _fail function will have freed arctx.

not all systems come with llvm-ar

No, but most systems come with only a bare minimum build toolchain installed. slibtool doesn't need to worry about whether a given build tool is available, only if it works with the most common ones that the user could want to use.

And evidently it (you) could chose not to support anything outside the GNU toolchain.

That would require a one-line patch in slbt_archive_import() --> check for failure, and resort to slbt_archive_import_mri() as needed.

Hmmm, just to the code, not to the build system?

could you debug this using lldb

I can try when I get back to my Mac. Idem for the thin archive thing - but if the archive is thin that is something the build system specified, not I. pkgconf builds OK but I can't affirm if the archive merge function is used there.

Did you notice that this is a bit of a cross-build, targetting 32-bit i386 rather than the native x86_64 code? I think that should make no difference for the archive, just its contents, but I guess there could easily be something in its headers that slibtool isn't expecting.

I'm attaching the build log of the same code with LLVM and very similar options on Linux. Of course llvm-ar has to comply with Linux standards when targetting Linux so the fact that "things work OK" there probably doesn't mean much...

dbus-build.log

Is that a thin archive by any chance? Can you paste the output of:

$ hexdump -C /path/to/archive.a | head

> hexdump -C dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a  | head
00000000  21 3c 61 72 63 68 3e 0a  23 31 2f 31 32 20 20 20  |!<arch>.#1/12   |
00000010  20 20 20 20 20 20 20 20  30 20 20 20 20 20 20 20  |        0       |
00000020  20 20 20 20 30 20 20 20  20 20 30 20 20 20 20 20  |    0     0     |
00000030  30 20 20 20 20 20 20 20  33 39 33 32 20 20 20 20  |0       3932    |
00000040  20 20 60 0a 5f 5f 2e 53  59 4d 44 45 46 00 00 00  |  `.__.SYMDEF...|
00000050  90 03 00 00 00 00 00 00  a0 0f 00 00 1e 00 00 00  |................|
00000060  d8 27 00 00 39 00 00 00  d8 27 00 00 50 00 00 00  |.'..9....'..P...|
00000070  d8 27 00 00 6a 00 00 00  d8 27 00 00 81 00 00 00  |.'..j....'......|
00000080  d8 27 00 00 97 00 00 00  d8 27 00 00 aa 00 00 00  |.'.......'......|
00000090  d8 27 00 00 c1 00 00 00  d8 27 00 00 dc 00 00 00  |.'.......'......|

For comparison, libpkgconf.a which doesn't give problems yields this:

> hexdump -C pkgconf-1.9.0/.libs/libpkgconf.a  | head
00000000  21 3c 61 72 63 68 3e 0a  23 31 2f 31 32 20 20 20  |!<arch>.#1/12   |
00000010  20 20 20 20 20 20 20 20  30 20 20 20 20 20 20 20  |        0       |
00000020  20 20 20 20 30 20 20 20  20 20 30 20 20 20 20 20  |    0     0     |
00000030  30 20 20 20 20 20 20 20  33 32 33 36 20 20 20 20  |0       3236    |
00000040  20 20 60 0a 5f 5f 2e 53  59 4d 44 45 46 00 00 00  |  `.__.SYMDEF...|
00000050  10 03 00 00 00 00 00 00  e8 0c 00 00 17 00 00 00  |................|
00000060  e8 0c 00 00 2a 00 00 00  e8 0c 00 00 48 00 00 00  |....*.......H...|
00000070  90 27 00 00 5e 00 00 00  90 27 00 00 71 00 00 00  |.'..^....'..q...|
00000080  90 27 00 00 87 00 00 00  90 27 00 00 9b 00 00 00  |.'.......'......|
00000090  e8 55 00 00 ba 00 00 00  e8 55 00 00 cf 00 00 00  |.U.......U......|

Yes, but you're discarding the return value from slbt_merge_archives_fail and continuing execution in slbt_merge_archives ultimately letting that function return 0. But the _fail function will have freed arctx.

Fixed in commit 4bc3453, thanks for the catch! In the future, a pointer to the exact line would be much appreciated:-) Apparently the oversight at the original time of writing the code carried over to my reviewing it after reading your original report ...

hexdump -C dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a | head
00000000 21 3c 61 72 63 68 3e 0a 23 31 2f 31 32 20 20 20 |!<arch>.#1/12 |
00000010 20 20 20 20 20 20 20 20 30 20 20 20 20 20 20 20 | 0 |
00000020 20 20 20 20 30 20 20 20 20 20 30 20 20 20 20 20 | 0 0 |
00000030 30 20 20 20 20 20 20 20 33 39 33 32 20 20 20 20 |0 3932 |
00000040 20 20 60 0a 5f 5f 2e 53 59 4d 44 45 46 00 00 00 | `.__.SYMDEF...|
00000050 90 03 00 00 00 00 00 00 a0 0f 00 00 1e 00 00 00 |................|
00000060 d8 27 00 00 39 00 00 00 d8 27 00 00 50 00 00 00 |.'..9....'..P...|
00000070 d8 27 00 00 6a 00 00 00 d8 27 00 00 81 00 00 00 |.'..j....'......|
00000080 d8 27 00 00 97 00 00 00 d8 27 00 00 aa 00 00 00 |.'.......'......|
00000090 d8 27 00 00 c1 00 00 00 d8 27 00 00 dc 00 00 00 |.'.......'......|
```

For comparison, libpkgconf.a which doesn't give problems yields this:
```

hexdump -C pkgconf-1.9.0/.libs/libpkgconf.a | head
00000000 21 3c 61 72 63 68 3e 0a 23 31 2f 31 32 20 20 20 |!<arch>.#1/12 |
00000010 20 20 20 20 20 20 20 20 30 20 20 20 20 20 20 20 | 0 |
00000020 20 20 20 20 30 20 20 20 20 20 30 20 20 20 20 20 | 0 0 |
00000030 30 20 20 20 20 20 20 20 33 32 33 36 20 20 20 20 |0 3236 |
00000040 20 20 60 0a 5f 5f 2e 53 59 4d 44 45 46 00 00 00 | `.__.SYMDEF...|
00000050 10 03 00 00 00 00 00 00 e8 0c 00 00 17 00 00 00 |................|
00000060 e8 0c 00 00 2a 00 00 00 e8 0c 00 00 48 00 00 00 |....*.......H...|
00000070 90 27 00 00 5e 00 00 00 90 27 00 00 71 00 00 00 |.'..^....'..q...|
00000080 90 27 00 00 87 00 00 00 90 27 00 00 9b 00 00 00 |.'.......'......|
00000090 e8 55 00 00 ba 00 00 00 e8 55 00 00 cf 00 00 00 |.U.......U......|
```

Thanks! We're getting closer I believe ... will right again in some twelve hours.

Actually, and since the error that was fixed in commit 4bc3453 did create a lot of spurious noise, (given the invalid pointer, etc.), please start the build from scratch. What I anticipate:

  • the segmentation fault will be gone (so we'll close this issue).
  • we'll see a proper report of the failure returned from slbt_get_archive_meta(), in which case a new issue should be opened.

On that last note, if the failure associated with slbt_get_archive_meta() is not ENOENT, then please attach the archive which slibtool failed to parse to the new issue. Thanks!

In the future, a pointer to the exact line would be much appreciated:-)

Sorry, yes. Somehow my own brain overlooked the fact that I'm using a very recent check-out (and was apparently too lazy to figure out if any of my patches moved the code in question to a different line)...

Going to see what lldb can teach me.

Actually, and since the error that was fixed in commit 4bc3453 did create a lot of spurious noise, (given the invalid pointer, etc.), please start the build from scratch. What I anticipate:

  • the segmentation fault will be gone (so we'll close this issue).
  • we'll see a proper report of the failure returned from slbt_get_archive_meta(), in which case a new issue should be opened.

Our posts crossed. I actually did that already, and attached the failing command and its output in my 1st reply to the OP...
So you're right, the SIGSEGV is/was gone, but the error output isn't really much more helpful beyond the indication of the line number.

(And we have here a good example of why it's not always productive to delay trace output as you apparently do ... the error message from slbt_get_archive_meta could have been printed and would have helped understanding why the crash occurred. ;) And I do hope this isn't a good example of me overlooking something instead ^^ )

Fair enough:-) Let's speed things up even further via the following:

  • start the build from scratch.
  • once it fails, run the following command from the top-level build directory and attach the output:

$ for f in $(find . -name '*.a'); do echo $f; slibtool-ar -Wcheck $f; done

You had to make me utter bash'ism, did you? :)

bash-4.3$  for f in $(find . -name '*.a'); do echo $f; slibtool-ar -Wcheck $f; done
./bus/.libs/libdbus-daemon-internal.a
./bus/.libs/liblaunch-helper-internal.a
./dbus/.libs/libdbus-1.a
./dbus/.libs/libdbus-internal.a
./test/.libs/libdbus-testutils.a

You had to make me utter bash'ism, did you? :)

bash-4.3$ for f in $(find . -name '*.a'); do echo $f; slibtool-ar -Wcheck $f; done ./bus/.libs/libdbus-daemon-internal.a ./bus/.libs/liblaunch-helper-internal.a ./dbus/.libs/libdbus-1.a ./dbus/.libs/libdbus-internal.a ./test/.libs/libdbus-testutils.a

Above suggests that all generated archive files could be parsed by slibtool-ar.

Closing this for now since the SIGSEGV problem has been resolved, discussion shall continue in the other issues:-)

Metadata Update from @midipix:
- Issue status updated to: Closed (was: Open)

2 months ago

Meanwhile, in the debugger, I've established that slbt_ar_read_decimal_64 apparently fails its backwards search of the end of the variable it tries to read from mark.

> (cd dbus-mp9-work/dbus-1.14.6-i386/test/ ;  lldb -- /opt/local/bin/rlibtool --tag=CC --mode=link ccache /opt/local/bin/clang-mp-9.0 -fno-strict-aliasing -fno-common -Wnested-externs -Wmissing-prototypes -Wstrict-prototypes -Wdeclaration-after-statement -Wimplicit-function-declaration -Wold-style-definition -Wall -Wextra -Wundef -Wwrite-strings -Wpointer-arith -Wmissing-declarations -Wredundant-decls -Wno-unused-parameter -Wno-missing-field-initializers -Wformat=2 -Wcast-align -Wformat-nonliteral -Wformat-security -Wsign-compare -Wstrict-aliasing -Wshadow -Winline -Wpacked -Wmissing-format-attribute -Wmissing-noreturn -Winit-self -Wmissing-include-dirs -Warray-bounds -Wreturn-type -Wswitch-enum -Wswitch-default -Wnull-dereference -Wdouble-promotion -Wchar-subscripts -Wfloat-equal -Wpointer-sign -Wno-deprecated-declarations -Wno-unused-label -Wno-error=unused-parameter -Wno-error=missing-field-initializers -Wno-error=deprecated-declarations -Wno-error=unused-label -O3 -g -flto=thin -march=native -arch i386 -L/opt/local/lib -Wl,-headerpad_max_install_names -O3 -g -flto=thin -Wl,-cache_path_lto,/Volumes/VMs/MPbuild/_Volumes_Debian_MP9_site-ports_devel_dbus/dbus/work/dbus-1.14.6/.lto_cache -march=native -arch i386 -o libdbus-testutils.la disable-crash-handling.lo test-utils.lo ../dbus/libdbus-1.la ../dbus/libdbus-internal.la)

I had to set a few breakpoints with trace output commands to figure that,
going in, I have

(lldb)  p *arhdr
(ar_raw_file_header) $82 = {
  ar_file_id = {
    [0] = '1'
    [1] = '/'
    [2] = '2'
    [3] = '8'
    [4] = ' '
    [5] = ' '
    [6] = ' '
    [7] = ' '
    [8] = ' '
    [9] = ' '
    [10] = ' '
    [11] = ' '
    [12] = ' '
    [13] = ' '
    [14] = ' '
    [15] = '0'
  }
  ar_time_date_stamp = {
    [0] = ' '
    [1] = ' '
    [2] = ' '
    [3] = ' '
    [4] = ' '
    [5] = ' '
    [6] = ' '
    [7] = ' '
    [8] = ' '
    [9] = ' '
    [10] = ' '
    [11] = '0'
  }
  ar_uid = {
    [0] = ' '
    [1] = ' '
    [2] = ' '
    [3] = ' '
    [4] = ' '
    [5] = '0'
  }
  ar_gid = {
    [0] = ' '
    [1] = ' '
    [2] = ' '
    [3] = ' '
    [4] = ' '
    [5] = '6'
  }
  ar_file_mode = {
    [0] = '4'
    [1] = '4'
    [2] = ' '
    [3] = ' '
    [4] = ' '
    [5] = ' '
    [6] = ' '
    [7] = '6'
  }
  ar_file_size = {
    [0] = '1'
    [1] = '4'
    [2] = '0'
    [3] = ' '
    [4] = ' '
    [5] = ' '
    [6] = ' '
    [7] = ' '
    [8] = ' '
    [9] = '`'
  }
  ar_end_tag = {
    [0] = '\n'
    [1] = 'd'
  }
}

Then, slbt_ar_read_decimal_64 triggers its eject at line 96 with

(lldb)  frame variable
(const char *) mark = 0x0000000101808438 "140      `\ndisable-crash-handling.o"
(int) len = 10
(uint64_t *) dec = 0x00007fff5fbf98e8
(int) i = 3
(uint64_t) res = 140

Process 11764 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
    frame #0: 0x0000000100006a58 rlibtool`slbt_ar_read_decimal_64(mark="140      `\ndisable-crash-handling.o", len=10, dec=0x00007fff5fbf98e8) at slbt_archive_meta.c:96
   93                           res *= 10;
   94                           res += (mark[i] - '0');
   95                   } else {
-> 96                           return -1;
   97                   }
   98           }
   99  

Errno does appear to be 2 (ENOENT) but I figure it could be useful anyway to attach the offending archive.

Except that haven't gotten beyond determining which archives are being merged in slbt_merge_archives (so not which is the offending one):

(lldb) p *arctxv[0]->path
(const char *const) $219 = 0x00007fff5fbfbe20 ".libs/libdbus-testutils.a"
(lldb) p *arctxv[1]->path
(const char *const) $220 = 0x0000000100804522 "../dbus/.libs/libdbus-1.a"

libdbus-testutils.a

Metadata Update from @RJVB:
- Issue status updated to: Open (was: Closed)

2 months ago

Had to bzip libdbus-1.a :

libdbus-1.a.bz2

(PS: is there a way to disable the smiley popup that appears whenever I type a colon after a space?!)

Closing this for now since the SIGSEGV problem has been resolved, discussion shall continue in the other issues:-)

Sorry, too late, didn't see you were posting while I was jumping back and forth between typing my debugging report and doing the actual debugging. The SEGV was just a side-effect of what actually fails ;)

Output from

> find dbus-mp9-work/dbus-1.14.6-i386/ -name "*.a" | xargs s
libtool-ar -Wverbose -Wprint >> /tmp/dbus-archives.txt

dbus-archives.txt

And wondering, why do your slbt_ar_read_* functions scan the buffer backwards trying to find the actual textual length of the value they want to read? Apparently you (can) assume that the textual representation always starts at mark[0], so why not simply read digits until you encounter something else, and only return an error if a) no digits were found or b) that "something else" != AR_DEC_PADDING? You could probably also simply do sscanf(mark, "%d ", &dec); (should return 1 on success).

(PS: is there a way to disable the smiley popup that appears whenever I type a colon after a space?!)

Through hardware, possibly :european_post_office:

(lldb) p *arhdr
(ar_raw_file_header) $82 = {
ar_file_id = {
[0] = '1'
[1] = '/'
[2] = '2'
[3] = '8'
[4] = ' '
[5] = ' '
[6] = ' '
[7] = ' '
[8] = ' '
[9] = ' '
[10] = ' '
[11] = ' '
[12] = ' '
[13] = ' '
[14] = ' '
[15] = '0'
}

Looks like a potential culprit: 1/28 with a trailing ar_file_id[15] = 0, which differs from both sysv and bsd4.4

Can you upload the corresponding .a file? The files you've uploaded do not exhibit the above and are successfully parsed by slibtool-ar, at least on m end.

Can you upload the corresponding .a file? The files you've uploaded do not exhibit the above and are successfully parsed by slibtool-ar, at least on m end.

No idea which file that is. For some reason this issue also doesn't occur with slibtool-ar on my end, and I still haven't figured out how to obtain the name of the file that slbt_get_archive_meta works on. It has to be either the 2 I attached, or the result of their merging (but which is that??)

I've been exploring my idea of parsing ar_file_size differently:

static int slbt_ar_read_decimal_64(const char * mark, int len, uint64_t * dec)
{
    int       i;
    uint64_t  res;

    for (i=0,res=0; i<len; i++) {
        if ((mark[i] >= '0') && (mark[i] <= '9')) {
            res *= 10;
            res += (mark[i] - '0');
        } else {
            break;
        }
    }

//  fprintf(stderr, "mark=<%s>[%d], i=%d, mark[i]='%c' res=%lu\n", mark, len, i, mark[i], res);
    if (mark[i] != AR_DEC_PADDING) {
        return -1;
    }

    *dec = res;

    return 0;
}

With that, slibtool-ar -Wverbose -Wprint produces the same output and my mystery archive passes the 1st test. It now bails at (the new) line 520 because *ch++ != AR_DEC_PADDING. That one is going to be a little less easy for me to figure out...

Yep, it's indeed libdbus-testutils.a. The offending ar_file_id:

(lldb) p arhdr->ar_file_id
(char [16]) $3 = "1/28           0           0     0     644     6140      `\ndisable-crash-handling.o"

and

> otool  -a dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a  
Archive : dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a
00   0/0     860 0 #1/12
0644   0/0    6140 0 #1/28
0644   0/0   52716 0 #1/12
> otool -v -a dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a
Archive : dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a
----------  0/0     860 Thu Jan  1 01:00:00 1970 __.SYMDEF
-rw-r--r--  0/0    6140 Thu Jan  1 01:00:00 1970 disable-crash-handling.o
-rw-r--r--  0/0   52716 Thu Jan  1 01:00:00 1970 test-utils.o

With my modified algorithm I'm getting:

> slibtool-ar -Wverbose -Wprint dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a
Archive:
  - Meta:
    - [ name: dbus-mp9-work/dbus-1.14.6-i386//test/.libs/libdbus-testutils.a ]

  - Members:
    - Member:
      - [ name:       disable-crash-handling.o ]
      - [ timestamp:  1970/01/01 @ 01:00 ]
      - [ filesize:   6112 ]
      - [ uid:        0 ]
      - [ gid:        0 ]
      - [ mode:       420 ]

    - Member:
      - [ name:       test-utils.o ]
      - [ timestamp:  1970/01/01 @ 01:00 ]
      - [ filesize:   52704 ]
      - [ uid:        0 ]
      - [ gid:        0 ]
      - [ mode:       420 ]

Does that correspond to what you're seeing?

00 0/0 860 0 #1/12
0644 0/0 6140 0 #1/28
0644 0/0 52716 0 #1/12

sigh lldb seems to be messing up the output, possibly by trying to be smart about #1.

The above output from your otool command shows that the binary file is actually fine, with #1/28 as the file id. And indeed, if this hadn't been the case then slibtool-ar -Wcheck would have failed.

Moving forward: in the lldb command-line, where you previously had p *arhdr, please try p /x *arhdr instead and we'll take things from there ...

Actually, I have to belay my previous message. It's not the archive I thought. It turns out that slbt_get_archive_meta is called THREE times, once on libdbus-1.a, once on libdbus-testutils.a, and once on a file I cannot seem to find. That last time, I have

(lldb)  p ((char*)archive->map_addr)
(char *) $121 = 0x000000010120e000 "!<arch>\n#1/12                       0     0     0       33731     `\n__.SYMDEF"

and I don't see that size (33731) in the otool output for any of the archives in my build directory. As far as I can see, slibtool also doesn't have any other archives open.

Is that mystery archive the in-memory one I've seen referenced?

please try p /x *arhdr instead and we'll take things from there ...

That looks like a gdb'ism that my version of lldb doesn't support. What is it supposed to do?

FWIW, the first file of that mystery archive does show correctly, no trying to be clever here:

(lldb)  p *arhdr
(ar_raw_file_header) $122 = {
  ar_file_id = {
    [0] = '#'
    [1] = '1'
    [2] = '/'
    [3] = '1'
    [4] = '2'
    [5] = ' '
    [6] = ' '
    [7] = ' '
    [8] = ' '
    [9] = ' '
    [10] = ' '
    [11] = ' '
    [12] = ' '
    [13] = ' '
    [14] = ' '
    [15] = ' '
  }
<snip>

I've been wondering if the missing hash couldn't be the result of a previous non-standard/unexpected difference that leads the entire buffer to be left-shifted 1 char. I'm also seeing

  ar_end_tag = {
    [0] = '\n'
    [1] = 'd'
  }

in the offending arhdr struct. All the correct ones end with a

  ar_end_tag = {
    [0] = '`'
    [1] = '\n'
  }

Is that mystery archive the in-memory one I've seen referenced?

Not a mystery, just in-memory :crocodile: to avoid file system races, the merged archive first exists only in memory, and is then saved to disk via slbt_store_archive().

The last step of the merge operation is to parse the newly created archive, which is why you saw slbt_get_archive_meta() being called three times.

I'll try merging the two archives that you uploaded earlier (libdbus-testutils.a, libdbus-1.a) and hopefully will be seeing the exact same error that you had encountered:-)

I've been wondering if the missing hash couldn't be the result of a previous non-standard/unexpected difference that leads the entire buffer to be left-shifted 1 char. I'm also seeing
ar_end_tag = { [0] = '\n' [1] = 'd' }
in the offending arhdr struct. All the correct ones end with a
ar_end_tag = { [0] = '`' [1] = '\n' }

Awesome. I almost had the same thought but needed to have it come from someone else:-) To be continued ...

Some more fodder, the callers for each of the 3 slbt_get_archive_meta invocations:

    frame #0: 0x000000010000456a rlibtool`slbt_get_archive_meta(dctx=0x0000000100802c88, archive=0x0000000100103d60, meta=0x0000000100103d70) at slbt_archive_meta.c:353
   350          memset(hdrinfobuf,0,sizeof(hdrinfobuf));
   351 
   352          mark = archive->map_addr;
-> 353          cap  = &mark[archive->map_size];
   354 
   355          /* preliminary validation */
   356          if (archive->map_size < sizeof(struct ar_raw_signature))
(lldb) up
frame #1: 0x0000000100001b11 rlibtool`slbt_get_archive_ctx(dctx=0x0000000100802c88, path=".libs/libdbus-testutils.a", pctx=0x00007fff5fbfbd00) at slbt_archive_ctx.c:83
   80                   return slbt_free_archive_ctx_impl(ctx,
   81                           SLBT_NESTED_ERROR(dctx));
   82  
-> 83           if (slbt_get_archive_meta(dctx,&ctx->map,&ctx->meta))
   84                   return slbt_free_archive_ctx_impl(ctx,
   85                           SLBT_NESTED_ERROR(dctx));
   86  
    frame #0: 0x000000010000456a rlibtool`slbt_get_archive_meta(dctx=0x0000000100802c88, archive=0x0000000100104290, meta=0x00000001001042a0) at slbt_archive_meta.c:353
   350          memset(hdrinfobuf,0,sizeof(hdrinfobuf));
   351 
   352          mark = archive->map_addr;
-> 353          cap  = &mark[archive->map_size];
   354 
   355          /* preliminary validation */
   356          if (archive->map_size < sizeof(struct ar_raw_signature))
(lldb) up
frame #1: 0x0000000100001b11 rlibtool`slbt_get_archive_ctx(dctx=0x0000000100802c88, path="../dbus/.libs/libdbus-1.a", pctx=0x00007fff5fbfbd08) at slbt_archive_ctx.c:83
   80                   return slbt_free_archive_ctx_impl(ctx,
   81                           SLBT_NESTED_ERROR(dctx));
   82  
-> 83           if (slbt_get_archive_meta(dctx,&ctx->map,&ctx->meta))
   84                   return slbt_free_archive_ctx_impl(ctx,
   85                           SLBT_NESTED_ERROR(dctx));
   86  
    frame #0: 0x000000010000456a rlibtool`slbt_get_archive_meta(dctx=0x0000000100802c88, archive=0x00000001001047e0, meta=0x00000001001047f0) at slbt_archive_meta.c:353
   350          memset(hdrinfobuf,0,sizeof(hdrinfobuf));
   351 
   352          mark = archive->map_addr;
-> 353          cap  = &mark[archive->map_size];
   354 
   355          /* preliminary validation */
   356          if (archive->map_size < sizeof(struct ar_raw_signature))
(lldb) up
frame #1: 0x0000000100003fe8 rlibtool`slbt_merge_archives(arctxv=0x00007fff5fbfbd00, arctxm=0x00007fff5fbfbcc8) at slbt_archive_merge.c:680
   677          struct slbt_archive_ctx_impl * ictx;
   678          ictx = slbt_get_archive_ictx(arctx);
   679 
-> 680          if (slbt_get_archive_meta(dctx,arctx->map,&ictx->meta) < 0)
   681                  return slbt_merge_archives_fail(
   682                          arctx,0,0,
   683                          SLBT_NESTED_ERROR(dctx));

So yeah, I guess slibtool is failing on its own in-memory archive?!

We should stop crossing posts like they were rapiers ^^

Anyway, my cats have been making their presence known for a while now; time for me to call it a day as far as this is concerned!

We should stop crossing posts like they were rapiers ^^

Anyway, my cats have been making their presence known for a while now; time for me to call it a day as far as this is concerned!

cat(1) one should always be made a priority :=)

Fixed in commit 32c2302. Please confirm!

And wondering, why do your slbt_ar_read_* functions scan the buffer backwards trying to find the actual textual length of the value they want to read? Apparently you (can) assume that the textual representation always starts at mark[0], so why not simply read digits until you encounter something else, and only return an error if a) no digits were found or b) that "something else" != AR_DEC_PADDING? You could probably also simply do sscanf(mark, "%d ", &dec); (should return 1 on success).

You must scan the entire byte array, and so it makes sense to test only for space-padding end-to-beginning, and then digits only beginning-to-detemrined-end. Think of:
- "123 123"
- "01234 junk"
- etc.

You must scan the entire byte array, and so it makes sense to test only for space-padding end-to-beginning, and then digits only beginning-to-detemrined-end.

I still don't see it, and I know that'll keep turning it my head if I don't react to this now...

The strings (byte arrays) I've seen contain either a number followed by whitespace and then some other content, or start with whitespace. I don't think I've seen any string that has only whitespace after the number we want to read, and your examples don't either. Sticking with those examples, if the _read function receives len==strlen(mark) on entry I don't see how the scan-from-end loop would decrease len. Nor how these examples would NOT cause an error to be returned, or why you'd want to scan the entire string. Because if so, what value should be returned for "123 123"?

If valid byte arrays always the number followed by spaces and some other content, my implementation should always return that number with less overhead. But maybe you can give a real-world example where it would go wrong?

The strings (byte arrays) I've seen contain either a number followed by whitespace and then some other content, or start with whitespace. I don't think I've seen any string that has only whitespace after the number we want to read, and your examples don't either. Sticking with those examples, if the _read function receives len==strlen(mark) on entry I don't see how the scan-from-end loop would decrease len. Nor how these examples would NOT cause an error to be returned, or why you'd want to scan the entire string. Because if so, what value should be returned for "123 123"?

Keep in mind that the purpose of the function is not just to "extract" the number from the byte array, but also to validate the byte array in its entirety. Accordingly, "123 123" should fail; and anything not having only 0x20 (space) after the last digit should fail.

And that's why I said that you'd have to scan the entire byte array anyway, which means two loops: one that scans all the digits, and another one that verifies that everything following the last digit is just the expected space padding. I found it more elegant to implement the scanning starting at the end, but this could of course be implemented the other way around. The most important thing is to validate all bytes, not just the first 0x20 following the last digit.

Keep in mind that the purpose of the function is not just to "extract" the number from the byte array, but also to validate the byte array in its entirety.

Ah! See, I had no way of knowing that (unless I missed a comment in or above another _read function) ...

Reproduced.

Out of curiosity, how? On a Mac? I'm not on mine until later today but have access to the files so tried to reproduce the issue on Linux by executing the same command using clang's --target cross-building feature, and didn't get any errors.

Fix confirmed but see #57.

Metadata Update from @RJVB:
- Issue status updated to: Closed (was: Open)

2 months ago

Reproduced.

Out of curiosity, how? On a Mac? I'm not on mine until later today but have access to the files so tried to reproduce the issue on Linux by executing the same command using clang's --target cross-building feature, and didn't get any errors.

At the core, the problem was not osx-specific. The bug would only manifest itself when size of the string table was odd, and that did happen only on osx (I guess on other platforms ar(1) automatically aligns the string table at a 2-byte boundary).

The slibtool-ar code did take alignment into account everywhere else, and not aligning the size of the string table was nothing but an oversight. Glad we ran into this!

Login to comment on this ticket.

Metadata
Attachments 5
Attached 2 months ago View Comment
Attached 2 months ago View Comment
Attached 2 months ago View Comment
Attached 2 months ago View Comment
Attached 2 months ago View Comment