Ignore DOCTYPE elements during XSD parse

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Ignore DOCTYPE elements during XSD parse

David Schloss


I am currently trying to find a solution as how to ignore DOCTYPE elements
when parsing XSD files with the loadGrammar function. I only want to ignore
DOCTYPE elements while parsing, but not <include> elements that may
reference other nested schema files I want as part of the overall Grammar.
Buried in any of these nested schemas could potentially be DOCTYPE elements
that should ignored as well. My last requirement in this is that I use a DOM


Below is a subset of some things I’ve already tried and had no success with:


* Set load external DTD to false and Validation to Never prior to
loadGrammar call.
* Set Skip DTD validation to true and Validation to Always prior to
loadGrammar call.
* Set create entity reference nodes to false (disregards all DOCTYPES
and include elements) and does not build grammar properly.
* Created custom entity resolver to return NULL when encountering
elements that match DTD files, and set parser to disable default entity
resolution process during parse
* Attempted all of the above with against both DOM parsers
(DOMLSParser and the XercesDOMParser)


My test is relatively simple.


1. Create parser
2. Configure parser
3. Pre-load parent schema into grammar pool
4. Configure parser to validate always
5. Parse a XML file that validates against the schema with no errors
thrown during parse operation


The offending file in my specific case is the xml.xsd
(https://www.w3.org/2004/10/xml.xsd) provided by W3C. That file has
XMLSchema.dtd referenced in a DOCTYPE in the 2nd line. Since this defines
the elements attributes of standard XML, if you set it to NULL in a custom
entity resolver, the Xerces parser will not be able to understand basic XML
element attributes like “xmlns”.  Curiously, if you comment out the
aforementioned DOCTYPE line in the schema, the Xerces parser is able to
fully parse the grammar without issue.


Am I being naïve as how the Xerces parsers are designed to work? Is there an
obvious solution to this situation that does not involve editing the schema
files to remove the DOCTYPE elements?


Thank you very much for your time,

Dave Schloss