
                    THE THEORY BEHIND REBATCH AND WHERE.TO.
                                       
Contents

     * Newsfeeds file
     * One innxmit fits all
       
   What rebatch Does
     * Newsfeeds parsing
     * rebatch's gory details
     * Files examined by rebatch
     * sub do_site_flush
     * sub rebatch_files
     * sub send_batch
       
   What where.to Does
     * Newsfeeds parsing
     * Two problems with where.to
       
   innxmit vs NNTPlink
     _________________________________________________________________
   
Assumptions

   rebatch reads your newsfeeds file to figure out which sites are fed by
   NNTPlink. It expects to see entries of the form:
   
     site:*:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp.host.org
     
   It makes the assumption that all NNTPlink feed sites are up for grabs.
   This decision is made by looking at the contents of the last field on
   the newsfeed entry. If it matches the expression in $nntplink then it
   is a target.
   
   If this is not true at your site, and until I think of something
   better you can try this:
   
   make a dummy newsfeeds file with lines like
   
     site:xxx:xxx:nntplink nntp.host
     
   and point rebatch at that.
   
   A separate configuration file is NOT an option - I want to think less
   not more each time I add a feed!
   
   If that does not appeal, you could create a symbolic link giving
   nntplink another name (e.g., nntplink2), use nntplink2 in your
   newsfeeds file and as the value of $nntplink_id in rebatch.conf.
   
   I also assume that the last thing on the newsfeeds line is the remote
   host to send it to. If that is not true, then a dummy newsfeeds file
   is called for.
   
   rebatch can cope with INN's comments, whitespace and continued lines
   in the same way as INN does.
   
   One other fun assumption is that one innxmit fits all. That is, it is
   suitable to send a batch file with...
   
     innxmit -t 300 remotehost batchfile
     
   ("-t 300" ensures you don't get hung processes if the remote site is
   off the air). You can alter the innxmit parameters on a global basis,
   but not on a case by case one.
   
   If there becomes a real need for a per-site configuration file, I will
   think about adding it. This was written for my site and (fortunately)
   I don't need that complexity.
     _________________________________________________________________
   
What Rebatch Does

   This section details the operation of the rebatch script.
   
  NEWSFEEDS PARSING
  
   rebatch (and where.to) start calling the subroutine read_newsfeeds in
   rebatch.common to parse the newsfeeds file.
   
   It reads line at a time from the file, discarding everything after a
   comment character, then leading and trailing whitespace.
   
   If the line ends in a '\' character it is appended to the previous
   line, sans the '\', and another line read. When it finds a line that
   does not end in a continuation character (and it is not blank) it
   assumes we have a valid newsfeeds line.
   
   That is, the entries
   
     site1:*:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp.host.org
     
     ## site2 is very fussy about the groups its gets
     site2:!*,\
     hundreds of individual group lines,\
     each ending in,\
     a continuation character,\
     !/local:Tc,Wnm:/usr/local/news/bin/nntplink -k -q nntp2.host.org
     
   are both parsed with as INN does.
   
   The line is then split apart on the ':' character.
   
   If there is a '/' in the first (site) portion, it and everything after
   it is disposed of.
   
   The fourth part is examined. If it contains the $nntplink_id (which
   can be a Perl regular expression) then we have an NNTPlink feed site
   rebatch can cope with. If we don't see $nntplink_id, then this entry
   is discarded. rebatch then splits the fourth part apart on spaces and
   takes the last portion (nntp.host.org and nntp2host.org in the above
   examples) as the host name we send articles to.
   
   The sitename and nntphost name are recorded in arrays to be used.
   
  REBATCH'S GORY DETAILS
  
   rebatch starts by making a lockfile and using shlock style locking to
   ensure that only one of itself is running at once. Nothing magic here.
   
   
   Then it reads the newsfeeds file (see above). It works through the
   sitename/nntphost pairs after that. This is where things get
   "interesting".
   
   I have found six types of files that an channel feed NNTPlink can
   leave behind:
    1. nntphost.link
    2. nntphost.1234
    3. nntphost.1234.tmp
    4. sitename
    5. sitename.1234
    6. nntphost.rebatch
       
   They are:
    1. nntphost.link is a status file NNTPlink uses. It contains things
       like the process ID of current NNTPlink process. This file is
       normal and rebatch ignores it.
    2. nntphost.1234 is created when the remote host gets too far behind
       in its acceptance of articles, or NNTPlink exited with articles to
       send. Files of type #3 get renamed into files with this name.
    3. nntphost.1234.tmp is created when NNTPlink can not contact the
       remote host, or the remote host refuses articles for some reason.
       Files of this form are open for writing by NNTPlink, so care needs
       to be taken with them. If NNTPlink has one of these files open
       when it exits, it renames it to be of the #2 form.
    4. sitename is created when NNTPlink is so far behind that INN
       notices and starts writing its own batch files. If this file
       exists, INN may have it open for writing, so the only way to cope
       with it is to rename the file, then send INN a flush for that
       site.
    5. sitename.1234 is not created by NNTPlink or INN. One of my beta
       sites asked for files of this form to be cleaned up (something to
       do with NNTPlink funnel feeds?) so I do.
    6. nntphost.rebatch files are created by the rebatch program and
       contain the concatenation of the five other files.
       
   5 and 6 are not created by NNTPlink or INN, but are considered anyway.
   
   
   For each site/hostname pair, rebatch calls the subroutine
   do_site_flush to scan the out.going directory for filenames of the
   form #1-#5 above. It does this with a shell glob matching the patterns
   
     $nntphost.*
     $sitename*
     
   Files of type #1 are ignored.
   
   If it finds files of type #3 or #4 it makes a mental note to flush the
   site.
   
   If it sees a file like #4 it renames it to $nntphost.0, to be picked
   up by a later glob. Hopefully 0 will never be a valid process ID, so
   this file will not be clobbered by NNTPlink. To make sure, if it sees
   it would overwrite a file, it appends 0 onto it until it gets a unique
   name.
   
   Files like #2, #5 or #6 are ignored at this point, but a note is made
   a significant file is found ... that is, a batch file that may need
   transmitting.
   
   Once the scanning is complete, it sees if it needs to flush the site
   and issues a ctlinnd command if it does.
   
   If a batch file that needs transmitting is found, it then calls the
   subroutine rebatch_files to collect the batch files together.
   
   sub rebatch_files takes the output of the shell glob matching these
   two patterns:
   
     $nntphost.*
     $site.*
     
   above and works through it. (NOTE the period in the second glob.)
   
   If the file is not of the form
   
     nntphost.1234
     
   or
   
     sitename.1234
     
   (that is types #2 and #5) it is ignored. Files of that form are
   concatenated to the .rebatch file, then unlinked.
   
   Then the batchfile is transmitted to the remote host in the subroutine
   send_batch.
   
   sub send_batch does shlock style locking to ensure only one innxmit
   process is going to be running at once to a site.
   
   If no other rebatch process is active for that site, it then double
   forks...
   
   The original parent moves to the next site.
   
   The first child create the lock file, forks and then waits for the
   grandchild to die.
   
   The grandchild reopens STDOUT and STDERR to the progress file, writes
   the current time in seconds since 1970 to the file and then execs
   innxmit.
   
   And that is about it!
     _________________________________________________________________
   
where.to gory detailsx

   where.to is a considerably simpler program.
   
   It also begins by reading the newsfeeds file as detailed above, then
   works through each site/host pair.
   
   If it sees a lock file for the site it notes the time and counts the
   number of ihave lines in the progress file. It then does a similar
   glob to rebatch to find all the batch files for the site and counts
   the number of lines in them.
   
   It then does a little math to figure out how many articles to send and
   the estimated time to send them, based on how many have gone before
   and how long it has taken.
   
   There are two flaws in where.to that I know of:
    1. where.to will tend to over estimate the time to completion...
       something about averages and rates and all that.
    2. where.to will over estimate the number the number of articles left
       to send in one case. When you have a batchfile with expired
       articles in it, innxmit will silently skip over the expired
       articles. The articles won't produce ihave lines and won't get
       counted as a 'transmitted' article.
       
   
   
   I have no plans to fix these. Earlier versions of where.to did all
   sorts of clever things (remember the last message ID, look for its
   position in the batch file), but they don't work well with the
   globing.
   
   If the site is not running, then it simply counts up the lines in the
   candidate batch file and tells you. If it sees a .tmp file it tells
   you as well.
     _________________________________________________________________
   
innxmit vs NNTPlink

   I have been asked about the virtues of using innxmit for sending
   batches of articles. After all, we abandoned nntpsend (which uses
   innxmit) in favour of NNTPlink for a performance boost, and by gayds
   we got one! Is it a step back to use anything but NNTPlink for this
   purpose?
   
   Answer: NO!
   
   The purpose of running NNTPlink as a channel feed is to pass on the
   article as soon as it arrives on the server. [Tom Limoncelli calls
   this the "INN Instant Party" and "a gimmick" in his INN FAQ.]
   
   If you do this, the text of the article will be in the system's buffer
   cache so won't have to be fetched off disk, so will go much quicker
   and so you get a performance boost.
   
   From the INN FAQ:
   
     Ian Phillipps <ian@unipalm.co.uk>:
     
     (2) More important, if you have a large number of feeds, NNTPlink
     permits them to be fed simultaneously with the same articles. No big
     deal, until you think of the what's going on in the pagedaemon and
     the disk cache.
     
     A "ps uaxr" rarely catches NNTPlink in the act ("D"), despite my
     having 17 of them last time I counted. Our biggest outgoing newsfeed
     delivered 16398 articles yesterday, using a total of 380 seconds CPU
     on a Sun IPC, and no disk time :-)
     
   Compared to running the same set of sites via innxmit where they you
   might be sending the same article to n sites but not all at the same
   time. You are going to have to retrieve the same article n times from
   disk. [Unless you have more buffer space than sense, of course.]
   
   When it comes to transmitting a back log, all of the gains of the
   buffer cache go out the window, and we are back to having retrieve
   articles off disk again.
   
   To transmit the backlog, the algorithm goes something like...
     * for each article in the batch file,
     * send 'ihave ' to remote server
     * If get a positive response, cat article at remote server
     * Send a '.'
     * Wait for the reply
       
   Excepting any major brain deadness in innxmit or NNTPlink I would
   expect the performance to be identical. I have not tested this theory.
   It is very difficult to test this. You need two batches of articles of
   approximately the same composition, but with different message IDs.
   Life is too short.
   
   So far it is a toss up between innxmit and NNTPlink.
   
   Then why did I choose innxmit?
   
   rebatch is modelled after INN's nntpsend program. They share one
   feature: they both append to a batchfile that innxmit is working from.
   I examined the code of innxmit and saw it took great pains to cope if
   people did that. I have not examined the code of NNTPlink to see if it
   does the same. NNTPlink does so much *more* than innxmit it was
   difficult to see what it was doing. That is no slur on NNTPlink, I
   just think innxmit is a better tool for this job.
   
   If anyone has hard evidence one way or the other about this, do let me
   know. If I am talking through my hat, do have the grace to tell me
   politely.
     _________________________________________________________________
   
    Part of the rebatch package
    Russell Street (r.street@auckland.ac.nz)
    Last updated: 12th February 1995
