[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Converting HTML to DocBook
>>>>> "j" == juro <[email protected]> writes:
j> Hello, I'd like to ask you if there's an easy way (a convertor)
j> to convert a html file to sgml. If yes, where can I get it.
If you really mean SGML, the answer is trivial: HTML _is_ an SGML so
the identity transformation will do the trick ;) I suspect, however
that you meant to ask if we can translate HTML into even a crude form
of DocBook DTD SGML.
I hope to be corrected, but the answer is apparently 'no' and
undaunted by the lack of an answer on many mailing lists, I attempted
to create one as a one-evening project; my experience was enough to
demonstrate the situation is non-trivial in the extreme.
For example, even if the HTML is _very_ well-behaved, we have to
regularly interpret constructs such as
<H1>the first section</H1>
<P>this is the first section</p>
<H2>and a subsection</h2>
<p>with some text</p>
<H1>the next section</h1)
and turn it into
<sect1><title>the first section</title>
<para>this is the first section</para>
<sect2><title>and a subsection</title>
<para>with some text</para>
</sect1>
<sect1> ...
There is no equivalent to the sectN tag in HTML, and the fundamental
differences only begin with this first most elemental element.
Consider the logic of parsing the <A> tag, which must be different for
NAME and HREF types, or the nightmare of tables within tables. Even
the <HEAD> does not really map to <ARTHEADER> and HTML is full of
non-containers which contain the contained information inside
attributes rather than within container tags (whew)
Ok, it helped me a little, so I am including my stylesheet with this
message --- be forwarned it is really sloppy because I was just toying
around with it; it is a composite of several postings on the DSSSL
list spliced together with no concern for consistent case or aesthetic
style. I estimate it saved me maybe 20% of the total time to
translate the kerneld HOWTO, and given that small gain, I really
wonder if it was worth it. Still, nothing is ever a complete failure:
it can always be used as a bad example.
<!--
HTML to Docbook transformation by Gary Lawrence Murphy
with all the ideas from many other people. It doesn't work
but it does save some of the grunt work in moving a well-behaved
HTML file to a format that is DocBook-like. Don't expect
miracles.
This stylesheet was simply spliced together from comments and
musings by various authors on the dssslist at mulberrytech.com
I cannot rightly claim copyright and only include the license
below to clarify the free nature of this work.
The A tag support is broken horribly, as is ULINK, but the UL
support alone makes it somewhat useful.
USE: jade -d html2db.dsl HTMLFILE > SGMLFILE
you may need to remove any DTD lines from the start of your HTML
LICENCE
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-->
<!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">
<style-sheet>
<style-specification id="html2db">
<style-specification-body>
(declare-flow-object-class element
"UNREGISTERED::James Clark//Flow Object Class::element")
(define (copy-attributes #!optional (nd (current-node)))
(let loop ((atts (named-node-list-names (attributes nd))))
(if (null? atts)
'()
(let* ((name (car atts))
(value (attribute-string name nd)))
(if value
(cons (list name value)
(loop (cdr atts)))
(loop (cdr atts)))))))
(default
(let* ((old-gi (gi (current-node)))
(new-gi
(case old-gi
(("HTML") "article")
(("HEAD") "artheader")
(("BODY") "sect1")
(("HR") empty-sosofo)
(("PRE") "screenshot")
(("UL") "itemizedlist")
(("I") "emphasis")
(("STRONG") "emphasis")
(("B") "emphasis")
(("TT") "command")
(("P") "para")
(("MENU") "itemizedlist")
(else old-gi))))
(make element
gi: new-gi
attributes: (copy-attributes))))
(element A
(let ((attr (list
(if (attribute-string "NAME")
(list "ID" (attribute-string "NAME"))
'())
(if (attribute-string "HREF")
(list "ULINK" (attribute-string "HREF"))
'()))))
(make element gi: "A"
attributes: attr
(process-children))))
(element LI
(make element gi: "listitem"
(make element gi: "para"
(process-children))))
(element H1
(make element gi: "sect2"
(make element gi: "title"
(process-children))))
(element H2
(make element gi: "sect3"
(make element gi: "title"
(process-children))))
(element H3
(make element gi: "sect4"
(make element gi: "title"
(process-children))))
</style-specification-body>
</style-specification>
</style-sheet>
--
Gary Lawrence Murphy <[email protected]>: office voice/fax: 01 519 4222723
TCI - Business Innovations through Open Source : http://www.teledyn.com
Canadian Co-ordinators for Bynari International : http://ca.bynari.net/
Free Internet for a Free O/S? - http://www.teledyn.com/products/FreeWWW/