[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Converting HTML to DocBook



>>>>> "j" == juro  <[email protected]> writes:

    j> Hello, I'd like to ask you if there's an easy way (a convertor)
    j> to convert a html file to sgml.  If yes, where can I get it.

If you really mean SGML, the answer is trivial: HTML _is_ an SGML so
the identity transformation will do the trick ;) I suspect, however
that you meant to ask if we can translate HTML into even a crude form
of DocBook DTD SGML.

I hope to be corrected, but the answer is apparently 'no' and
undaunted by the lack of an answer on many mailing lists, I attempted
to create one as a one-evening project; my experience was enough to
demonstrate the situation is non-trivial in the extreme.

For example, even if the HTML is _very_ well-behaved, we have to
regularly interpret constructs such as

    <H1>the first section</H1>
    <P>this is the first section</p>
    <H2>and a subsection</h2>
    <p>with some text</p>
    <H1>the next section</h1)

and turn it into 

    <sect1><title>the first section</title>
    <para>this is the first section</para>
    <sect2><title>and a subsection</title>
    <para>with some text</para>
    </sect1>
    <sect1> ...

There is no equivalent to the sectN tag in HTML, and the fundamental
differences only begin with this first most elemental element.
Consider the logic of parsing the <A> tag, which must be different for
NAME and HREF types, or the nightmare of tables within tables.  Even
the <HEAD> does not really map to <ARTHEADER> and HTML is full of
non-containers which contain the contained information inside
attributes rather than within container tags (whew)

Ok, it helped me a little, so I am including my stylesheet with this
message --- be forwarned it is really sloppy because I was just toying
around with it; it is a composite of several postings on the DSSSL
list spliced together with no concern for consistent case or aesthetic
style.  I estimate it saved me maybe 20% of the total time to
translate the kerneld HOWTO, and given that small gain, I really
wonder if it was worth it.  Still, nothing is ever a complete failure:
it can always be used as a bad example.

<!--  
   HTML to Docbook transformation by Gary Lawrence Murphy 
   with all the ideas from many other people.  It doesn't work
   but it does save some of the grunt work in moving a well-behaved
   HTML file to a format that is DocBook-like.  Don't expect
   miracles.

   This stylesheet was simply spliced together from comments and
   musings by various authors on the dssslist at mulberrytech.com
   I cannot rightly claim copyright and only include the license
   below to clarify the free nature of this work.

   The A tag support is broken horribly, as is ULINK, but the UL
   support alone makes it somewhat useful.

 USE: jade -d html2db.dsl HTMLFILE > SGMLFILE

   you may need to remove any DTD lines from the start of your HTML

 LICENCE

 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
  
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
  
 You should have received a copy of the GNU General Public License
 along with this program; if not, write to the Free Software
 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

-->

<!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">
<style-sheet>

<style-specification id="html2db">
<style-specification-body> 

(declare-flow-object-class element
       "UNREGISTERED::James Clark//Flow Object Class::element")

(define (copy-attributes #!optional (nd (current-node)))
  (let loop ((atts (named-node-list-names (attributes nd))))
    (if (null? atts)
        '()
        (let* ((name (car atts))
               (value (attribute-string name nd)))
          (if value
              (cons (list name value)
                    (loop (cdr atts)))
              (loop (cdr atts)))))))

(default 
  (let* ((old-gi (gi (current-node)))
         (new-gi
          (case old-gi

            (("HTML") "article")
            (("HEAD") "artheader")
            (("BODY") "sect1")
            (("HR") empty-sosofo)
            (("PRE") "screenshot")
            (("UL") "itemizedlist")
            (("I")  "emphasis")
            (("STRONG")  "emphasis")
            (("B")  "emphasis")
            (("TT") "command")
            (("P") "para")
            (("MENU") "itemizedlist")
            (else old-gi))))
    (make element
      gi: new-gi
      attributes: (copy-attributes))))

(element A
  (let ((attr (list
               (if (attribute-string "NAME")
                   (list "ID" (attribute-string "NAME"))
                   '())
               (if (attribute-string "HREF")
                   (list "ULINK" (attribute-string "HREF"))
                   '()))))
    (make element gi: "A"
          attributes: attr
          (process-children))))

(element LI
  (make element gi: "listitem"
        (make element gi: "para"
              (process-children))))

(element H1
  (make element gi: "sect2"
        (make element gi: "title"
              (process-children))))

(element H2
  (make element gi: "sect3"
        (make element gi: "title"
              (process-children))))

(element H3
  (make element gi: "sect4"
        (make element gi: "title"
              (process-children))))

</style-specification-body>
</style-specification>
</style-sheet>


-- 
Gary Lawrence Murphy <[email protected]>: office voice/fax: 01 519 4222723
TCI - Business Innovations through Open Source : http://www.teledyn.com
Canadian Co-ordinators for Bynari International : http://ca.bynari.net/
Free Internet for a Free O/S? - http://www.teledyn.com/products/FreeWWW/