Differences in parse forests (gold vs obtained with supertags identical to gold lexical types)

(Not sure what the problem is, so, not sure what the best category would be for this.)

I am experimenting with a new supertagger for ACE+ERG. What I am getting is: sequences of tags which look accurate to me and then parses which FFTB does not consider similar enough to the gold parses.

I need to understand what is going on and what to do about it, but there are too many pieces to it so I end up confused.

Suppose we have a sentence (from the pest dataset): Not all those who wrote oppose the changes.

The supertagger tags terminals for lexical type. I’ve extracted the gold lexical types using pydelphin (second column), and I supply the tags as listed in the third column:

not	       av_-_dg-det_le	     av_-_dg-det_le
all	       n_np_mc-a_le	         n_np_mc-a_le
those	   n_-_pr-dei-pl_le	     n_-_pr-dei-pl_le
who	       n_-_pr-rel-who_le     n_-_pr-rel-who_le
wrote	   v_np*_le	             v_np*_le
oppose	   v_np_le	             v_np_le
the	       d_-_the_le	         d_-_the_le
changes	   n_pp_mc-of_le	     n_pp_mc-of_le
.	       pt_-_period_le	     pt_-_period_le

The tag sequences look identical to me? I do get parses, but when I compare the treebanks with FFTB, I see:

How should I approach this? I suppose I should be looking at the profile files themselves and compare things there. Which specific things?