HACKING LibreDWG

"???" means that there is still some question to resolve;
someone should probably think about and resolve them!

* regen

  To regenerate the configure script and Makefile.in files,
  change to the top-level directory and do: "sh autogen.sh".
  Make sure you have recent Autoconf, Automake, Libtool installed.
  (See comments in autogen.sh for specific versions tested.)

* other tools

  Aside from the the autotools (see "regen", above),
  you will also need GNU Texinfo to build the manual.
  See e.g. the github action for recipes, or .appveyor.yml
  for another Mingw recipe.
  The compiler needs to support C99, i.e. MSVC cannot be used.

* coding standards

  We try to follow the GNU Coding Standards, with some exceptions.
  - from Emacs: (info "(standards)")
  - from shell: $ info standards

  We format the code now with clang-format. For include/*.h we would
  require >= 8.0 for the new StatementMacros option, but we don't process
  include yet, because src/gen-dynapi.pl is too fragile.
  You can use

    $ bash build-aux/clang-format-all.sh src programs examples test

  or use some clang-format editor integration.
  See https://clang.llvm.org/docs/ClangFormat.html

  The exceptions are:
  - [[change log maintenance]]

* version numbers

  We follow semantic versioning, using git-version-gen with no v prefix.
  e.g. 0.5, 0.5.0.1099, 0.5.0.1093.1_967f

  Major for breaking changes in the API of otherwise declared stable entities.

  Minor for adding features, bugfixes and minor changes.
  Releases mostly only carry those two, and maybe the patch number.

  Patch for backwards-compatible bug fixes. It's optional and left-out if 0.

  The build number is automatically incremented for smoke and master builds
  and creates a volatile tag.
  It's left out in releases, only serves as volatile tag for nightly master
  builds on github.

  The optional number after those 4 numbers is the number of commits not yet
  merged to master. E.g. the 1 in 0.5.0.1093.1_967f.
  And finally for local development builds the abbrevated git tag, e.g. the 967f
  in 0.5.0.1093.1_967f.

  So a released version will be 2 or 3 numbers, a development version will also
  carry the 4-digit build number, and if it's a branch also two more elements.

  Version numbers are generated manually for a release by pushing a 2-3 number tag, 
  and automatically by bumping the version in .appveyor.yml.

* make regen-dynapi

  Whenever you change a class or any class field (add, remove, rename, change a type),
  add a separate make regen-dynapi commit, as result of this command. So the small change
  is separated from the automatically created changes by gen-dynapi.pl.

  This command updates several generated lists of objects, classes, and its tests.

* change log maintenance

  Presently, there is only one top-level ChangeLog file, and commits go
  in without updating it.  For releases, we generate
  ChangeLog entries based on the commit logs.
  We use the script create-changelog (in build-aux/) for that.

  This means that then the commit logs should follow
  the GNU Coding Standards.  For example:

  | Add foo, with increased bar.
  |
  | Normally, we don't need to foo, but sometimes it is necessary.
  | In those cases, we might as well use a bigger bar.
  |
  | * src/part.h (foo): New decl.
  |   (bar): Bump value of this #define to 42.
  | * src/part.c (foo): New func.
  | * src/main.c (main): In the case of `sometimes', call `foo'.
  | * test/special.test (normally): Don't test `sometimes'.
  |   (sometimes): New test case.
  | * doc/whole.texi (Special Cases): Document `sometimes' handling.

  This example has three parts: a one-line sentence describing the change,
  followed by two newlines, followed by a short discussion of the change,
  followed by entries for each of the five changed files.  A template:

  | ONE-LINE SENTENCE
  |
  | DISCUSSION
  |
  | * CHANGES-TO-FILE
  | [...]

  For small changes or when the one-line sentence suffices, the discussion
  (and its following two newlines) can be dropped:

  | ONE-LINE SENTENCE
  |
  | * CHANGES-TO-FILE
  | [...]

  There are some conventions for the one-line sentence:

  - Suffix "; nfc." means no functional change (e.g., changing comments only).
    This causes create-changelog to omit the entry from its output.

  - Prefix "TOPIC:" means this change is about some TOPIC.
    Some topics we use are:
    - admin  -- administrative stuff (e.g., this file)
    - build  -- configuration, makefiles, etc
    - decode -- read path (decoding)
    - encode -- write path (encoding)
    - dxf    -- dxf writer
    - indxf  -- dxf reader
    - binding -- language bindings
    - api    -- user API
    - doc    -- documentation

* trailing whitespace

  Don't be uncool; avoid introducing trailing whitespace!  See:
  <http://old.nabble.com/Re:-whitespace-cleanup-p6850253.html>

* branch names

  If you want to push a branch that may be "git rebase"d in the future,
  either use the prefix "wip-" (work in progress), or your Savannah
  username followed by a slash (e.g., "juca/").
  There are also "work/*" and "smoke/*" branches.

* make release

  This might be better in an in-tree ./configure, not in an extra build
  directory. But out-of-tree is also supported now.
  It needs the default configure options, esp. the enabled bindings.

  Before a release:
  - update NEWS, .appveyor.yml, libredwg.spec manually
  - generate the missing ChangeLog entries
    e.g. via build-aux/gitlog-to-changelog --since='2018-11-05' >x
  - make distcheck
  - push a smoke/ branch to check the CI results for linux, darwin, freebsd, mingw
    and cygwin.
  - create a temp. tag with the correct version number (see above):
    e.g. git tag -s -m 'release 0.6.2' 0.6.2
  - sh autogen.sh to update the version
  - make regen-man to update the manpages

  - update/create the release commit and sign it with -S
    e.g. git commit -S --amend -a -m 'Release 0.6.2
    see NEWS'
  - merge it into master (ff)
  - update the tag: git tag -d 0.6.2; git tag -s -m 'release 0.6.2' 0.6.2
  - make dist to create the source tarballs
  - push master and tags to run the CI and create the windows binaries on appveyor
  - upload the dist tarballs
    build-aux/gnupload --to ftp.gnu.org:libredwg libredwg-0.6.2.tar.gz libredwg-0.6.2.tar.xz
  - download the appveyor artifacts and sha256sum and sign it
    gpg -b -a libredwg-0.6.2-win32.zip; mv libredwg-0.6.2-win32.asc libredwg-0.6.2-win32.sig
    sha256sum libredwg-0.6.2*
  - edit the github release, copy from the previous and fix up the text with the sha256sum's,
    upload the dists and sigs to this page.
  - regen the docs, the refman and manual
    make manual refman
  - update the libredwg-cvs checkout for the docs and GNU homepage with the updated docs via
    make release-web
  - create the announcement via build-aux/announce-gen (needs lots of args)
    and fixup the header with the NEWS
  - create a savannah news item with the announcement, and post it to the announcement
    mailinglist and twitter. maybe also to reddit.com/r/cad and similar forums.

* using gdb with programs in examples/

  The programs in examples are built by libtool and dynamically linked
  against the pre-installed library by using a wrapper script.  To run
  them under gdb, use:

  $ libtool --mode=execute gdb PROGRAM

  But it is easier to pass --disable-shared to configure and call
  gdb --args directly.

* mingw cross-compilation

  If you have 32-bit wine use the i686-w64-mingw32 target,
  add CFLAGS="-gdwarf-2" for debugging with winedbg, best with --disable-shared.
  Copy some required mingw dll's into your programs dir.

  Recommended for debugging:

  $ ./configure --enable-trace --enable-write --host=i686-w64-mingw32
  $ make CFLAGS="-gdwarf-2"

  Sample session in programs:

  $ make -C .. CFLAGS="-gstabs" && \
    cp ../src/.libs/libredwg-0.dll . && \
    LIBREDWG_TRACE=4  winedbg .libs/dwgread.exe ../test/test-data/2000/Leader.dwg
  > b dwg_decode_eed
  > cont

* python on macports

  On macports with system python overriding the macports python2.7 you'd might need to set
  either:

  $ export PYTHONPATH=/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/

  or run the tests with:

  $ make check PYTHON=/opt/local/bin/python2.7

  because the system python is missing libxml2.

  Or add the macports libxml2 to the system python2.7:

  $ port install py27-libxml2
  $ sudo cp /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/libxml2* \
            /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/

* fuzzing with afl-fuzz

On darwin I need to set AFL_CC and CC.

  make clean
  CC="afl-clang" ./configure --enable-trace --enable-write --disable-shared
  make
  mkdir fuzz-in; cp test/test-data/example_2000.dwg fuzz-in/
  afl-fuzz -i fuzz-in -o fuzz-out -- programs/dwgread -

Using the fast option and an internal loop would be faster. I get 220/sec uninstrumented
and 800/sec instrumented without -O2, which is fast enough to finish within 30m for a 32k DWG.

Update: With honggfuzz:

    ../configure --disable-shared --disable-bindings CC=hfuzz-clang CFLAGS='-O2 -g -fsanitize=address,undefined -fno-omit-frame-pointer -I/usr/local/include'
    make -C src && make -C examples dwgfuzz
    honggfuzz -i ../.fuzz-in-dxf -- examples/dwgfuzz -indxf ___FILE___

I added a better examples/dwgfuzz for faster persisent mode and more coverage. Up to 2000/sec.
There's also a new examples/llvmfuzz which finds even more bugs.

    make -C src
    clang -I../src -Isrc -g -O3 -fsanitize=address,fuzzer ../examples/llvmfuzz.c -Lsrc/.libs -lredwg
    LD_LIBRARY_PATH=src/.libs ./a.out  -timeout=4000 -detect_leaks=0 -rss_limit_mb=8000 ../test/test-data/

* adding other code

You can only add significant code by some author who has copyright
assigned to the FSF or signed a copyright disclaimer with the FSF. See
CONTRIBUTING.

The license of this work (code, docs, ...) must be GPLv3 compatible,
see the list at USING_FOREIGN_CODE.

* reverse-engineering with examples/unknown

There's a lot of code related to examples/unknown to automatically
find the field layout of yet unknown classes. At first you need
DWG/DXF pairs of unknown entities or objects and put them into
test/test-data/. At creation take care to create uniquely identifiable
names and numbers, not to create DXF fields all with the same value 0.
Then you'll never known which field in the DWG is which.

Then run make -C examples regen-unknown, which does this:

run ./logs-all.sh to create -v4 logfiles with the binary blobs for all
UNKNOWN_OBJ and UNKNOWN_ENT instances in those DWG's.

Then the perl script log_unknown.pl creates the include file
alldwg.inc adding all those blobs.

The next perl script log_unknown_dxf.pl parses alldwg.inc and looks
for matching DXF files, and creates the 3 include files alldxf_0.inc
with the matching blob data from alldwg.inc, alldxf_1.inc with the
matching field types and values from the DXF and alldxf_2.inc to
workaround some static initialization issues in the C file.

Next run make unknown, which does this:

Compiles and runs examples/unknown, which creates for a every string
value in the DXF some bits representations and tries to find them in
the UNKNOWN blobs. If it doesn't find them, either the string-to-bit
conversion lost too much precision to be able to find them, esp. with
doubles, or we have a different problem. make unknown creates a big
log file unknown-`git describe`.log in which you can see the
individual statistics and initial layout guesses.

E.g.
42/230=18.3%
possible: [34433333344443333334444333333311xxxxxxxxxx3443333...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
11 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 11   1]

The x stands for a fixed field, the numbers and a dot for the number
of variants this bit is used for (the dot for >9) and a space means
this is a hole for a field which is not represented as DXF field, i.e.
a FIELD_*(name, 0) in the dwg.spec with DXF group code 0.

unknown also creates picat data files in examples/ which are then used with
picat from http://picat-lang.org to enhance the search for the best layout
guess for each particular class. picat is a nice mix of a functional
programming tool with an optional constraint solver. The first part in
the picat process does almost the same as unknown.c, finding the fixed
layout, possible variants and holes in a straight-forward functional
fashion. This language is very similar to erlang, untyped haskell or prolog.
The second optimization part of picat uses a solver with
constraints to improve the layout of the found variants and holes to
find the best guess for the needed dwg.spec layout.
Note that picat list and array indices are one-based, so you need to
subtract 1 from each found offset. 1-32 mean the bits 0-31.

The field names are filled in by examples/log_unknown_dxf.pl automatically.
We could parse dwg.spec for this, but for now I went with a manual solution,
as the number of unknown classes gets less, not more.

E.g. for ACAD_EVALUATION_GRAPH.pi with a high percentage from the above
possible layout, it currently produces this:

Definite result:
----------------
HOLE([1,32],01000000010100000001010000000110) len = 32
FIELD_BL (edge_flags, 93);	// 32 [33,42]
HOLE([43,52],0100000001) len = 10
FIELD_BL (node_edge1, 92);	// -1 [53,86]
FIELD_BL (node_edge2, 92);	// -1 [87,120]
FIELD_BL (node_edge3, 92);	// -1 [121,154]
FIELD_BL (node_edge4, 92);	// -1 [155,188]
HOLE([189,191],100) len = 3
FIELD_H (ownerhandle, 330);	// 6.0.0 [192,199]
FIELD_H (evalexpr, 360);	// 3.2.2E2 [200,223]
HOLE([224,230],1100111) len = 7
----------------
Todo: 32 + 178 = 210, Missing: 20
FIELD_BL (has_graph, 96);	// 1 0100000001 [[1,10],[11,20],[21,30],[43,52]]
FIELD_BL (unknown1, 97);	// 1 0100000001 [[1,10],[11,20],[21,30],[43,52]]
FIELD_BL (nodeid, 91);	// 0 10 [[2,3],[10,11],[12,13],[20,21],[22,23],[31,32],[44,45],[52,53],[189,190],[225,226]]
FIELD_BL (num_evalexpr, 95);	// 1 0100000001 [[1,10],[11,20],[21,30],[43,52]]

The next picat steps will automate the following reasoning:

The first hole 1-32 is filled by the 3 1 values from BL96, BL97 and
BL95, followed by the 0 value from BL91. The second hole is clearly
another unknown BL with value 1.  The third hole at 189-191 is padding
before the handle stream, and can be ignored.  This is from a r2010
file, which has separate handle and text streams.  The last hole
224-230 could theoretically hold almost another unknown handle, but
practically it's also just padding. The last handles are always
optional reactors and the xdicobject handle for objects, and 7 bits is
not enough for a handle value. A code 4 null-handle would be 01000000.

You start by finding the DXF documentation and the ObjectARX header
file of the class, to get the names and description of the class.

You add the names and types to dwg.h and dwg.spec, change the class
type in classes.inc to DEBUGGING or UNSTABLE. With DEBUGGING add the
-DDEBUG_CLASSES flag to CFLAGS in src/Makefile (e.g. by
--enable-debug) and test the dwg's with programs/dwgread -v4 (e.g. by ./log file).
Some layouts are version dependent, some need a REPEAT loop or vector
with a num_field field.

The picat constraints module examples/unknown.pi is still being worked
and is getting better and better identifying all missing classes
automatically. The problem with AutoCAD DWG's is that everybody can
add their own custom classes as ObjectARX application, and that
reverse-engineering them never stops. So it has to be automated somehow.

There are also two more helpers bd and bits in examples/, which decode a bit
pattern to the most likely value/type combination or all.

* Convert unknown_bits HEX to binary

Store the HEX string from the log into a file, like acds.hex.

    perl -ne'$_ =~ s/(..)/chr(hex($1))/ge; print' acds.hex >acds.dat

* etc
#+STARTUP: odd
  Local variables:
  mode: org
  End: