• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Script for pulling values from an rss file

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Script for pulling values from an rss file

    I have an rss file containing the following;
    Code:
    <?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
     <title>Some Tide Times</title>
     <link>http://www.tidetimes.org.uk/Some-tide-times</link>
     <description>Some tide times.</description>
     <lastBuildDate>Wed, 27 Mar 2013 00:00:00 GMT</lastBuildDate>
     <language>en-gb</language>
     <atom:link href="http://www.tidetimes.org.uk/Some-tide-times.rss" rel="self" t
    ype="application/rss+xml"/>
     <item>
      <title>Some Tide Times for 27th March 2013</title>
      <link>http://www.tidetimes.org.uk/Some-tide-times</link>
      <guid>http://www.tidetimes.org.uk/Some-tide-times</guid>
      <pubDate>Wed, 27 Mar 2013 00:00:00 GMT</pubDate>
      <description>&lt;a href="http://www.tidetimes.org.uk" title="Tide Times"&gt;Ti
    de Times&lt;/a&gt; &amp; Heights for&lt;br/&gt;&lt;a href="http://www.tidetimes.
    org.uk/Some-tide-times" title="Some tide times"&gt;Some&lt;/a&gt; on 27th Mar
    ch 2013&lt;br/&gt;&lt;br/&gt;00:34 - Low Tide &#x28;1.40m&#x29;&lt;br/&gt;06:43 
    - High Tide &#x28;11.60m&#x29;&lt;br/&gt;12:58 - Low Tide &#x28;1.40m&#x29;&lt;b
    r/&gt;19:05 - High Tide &#x28;11.70m&#x29;&lt;br/&gt;</description>
     </item>
    </channel>
    </rss>
    I have no idea how sed and awk work but I guess they are the tools for the job? What I want to do is be able to pull the High and low tide times and heights from inside the description tags using a shell script.

    Any ideas?

    TIA

    Pondy

    #2
    Originally posted by Pondlife View Post
    I have an rss file containing the following;
    Code:
    <?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
     <title>Some Tide Times</title>
     <link>http://www.tidetimes.org.uk/Some-tide-times</link>
     <description>Some tide times.</description>
     <lastBuildDate>Wed, 27 Mar 2013 00:00:00 GMT</lastBuildDate>
     <language>en-gb</language>
     <atom:link href="http://www.tidetimes.org.uk/Some-tide-times.rss" rel="self" t
    ype="application/rss+xml"/>
     <item>
      <title>Some Tide Times for 27th March 2013</title>
      <link>http://www.tidetimes.org.uk/Some-tide-times</link>
      <guid>http://www.tidetimes.org.uk/Some-tide-times</guid>
      <pubDate>Wed, 27 Mar 2013 00:00:00 GMT</pubDate>
      <description>&lt;a href="http://www.tidetimes.org.uk" title="Tide Times"&gt;Ti
    de Times&lt;/a&gt; &amp; Heights for&lt;br/&gt;&lt;a href="http://www.tidetimes.
    org.uk/Some-tide-times" title="Some tide times"&gt;Some&lt;/a&gt; on 27th Mar
    ch 2013&lt;br/&gt;&lt;br/&gt;00:34 - Low Tide &#x28;1.40m&#x29;&lt;br/&gt;06:43 
    - High Tide &#x28;11.60m&#x29;&lt;br/&gt;12:58 - Low Tide &#x28;1.40m&#x29;&lt;b
    r/&gt;19:05 - High Tide &#x28;11.70m&#x29;&lt;br/&gt;</description>
     </item>
    </channel>
    </rss>
    I have no idea how sed and awk work but I guess they are the tools for the job? What I want to do is be able to pull the High and low tide times and heights from inside the description tags using a shell script.

    Any ideas?

    TIA

    Pondy
    You probably want to use XSL.

    Sed and awk aren't really suited to xml.
    While you're waiting, read the free novel we sent you. It's a Spanish story about a guy named 'Manual.'

    Comment


      #3
      Originally posted by Pondlife View Post
      I have an rss file containing the following;
      Code:
      <?xml version="1.0" encoding="utf-8"?>
      <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
      <channel>
       <title>Some Tide Times</title>
       <link>http://www.tidetimes.org.uk/Some-tide-times</link>
       <description>Some tide times.</description>
       <lastBuildDate>Wed, 27 Mar 2013 00:00:00 GMT</lastBuildDate>
       <language>en-gb</language>
       <atom:link href="http://www.tidetimes.org.uk/Some-tide-times.rss" rel="self" t
      ype="application/rss+xml"/>
       <item>
        <title>Some Tide Times for 27th March 2013</title>
        <link>http://www.tidetimes.org.uk/Some-tide-times</link>
        <guid>http://www.tidetimes.org.uk/Some-tide-times</guid>
        <pubDate>Wed, 27 Mar 2013 00:00:00 GMT</pubDate>
        <description>&lt;a href="http://www.tidetimes.org.uk" title="Tide Times"&gt;Ti
      de Times&lt;/a&gt; &amp; Heights for&lt;br/&gt;&lt;a href="http://www.tidetimes.
      org.uk/Some-tide-times" title="Some tide times"&gt;Some&lt;/a&gt; on 27th Mar
      ch 2013&lt;br/&gt;&lt;br/&gt;00:34 - Low Tide &#x28;1.40m&#x29;&lt;br/&gt;06:43 
      - High Tide &#x28;11.60m&#x29;&lt;br/&gt;12:58 - Low Tide &#x28;1.40m&#x29;&lt;b
      r/&gt;19:05 - High Tide &#x28;11.70m&#x29;&lt;br/&gt;</description>
       </item>
      </channel>
      </rss>
      I have no idea how sed and awk work but I guess they are the tools for the job? What I want to do is be able to pull the High and low tide times and heights from inside the description tags using a shell script.

      Any ideas?

      TIA

      Pondy
      Should be relatively straight forward to do it with a regular expression (regex) as the data will be in a consistent format. ie always low tide height, high tide height, low tide height, high tide height with the values always in the same format nn:nn n.nnm for time and height respectively. You might have to frig it a bit to account for exceptional tide heights (10m +) if they exist.

      I'm rusty on this stuff as I havent' written shell scripts in years but you should be able to use an appropriately crafted regex in sed to ditch everything up to the first tide time then strip out the extraneous rubbish between the data you want using pattern matching to pick out the bits you want to keep and dump the whole lot into a file.

      Sed itself is easy , basic pattern is :

      sed -e 's/oldstuff/newstuff/g' inputFileName > outputFileName

      Which is basically saying run sed in execute mode (-e), search the input file for a pattern that matches "oldstuff" and replace it with "newstuff" and put the whole thing into a new file when your done.

      Replace oldstuff and newstuff with the regex to identify the data you want and the desired output and you should be away.

      Of course figuring out the regex is going to be the fun part. especially as you are going to have to use back references to hold the bits you want and put them into the output file.

      If you are running is a command line setting it may be easier to do the whole thing in Perl if you have it available, although you will still need to get your head around the regex.
      "Being nice costs nothing and sometimes gets you extra bacon" - Pondlife.

      Comment


        #4
        This

        Code:
        cat $1 | grep "description.*Low" | sed 's/&#x28;/(/g' | sed 's/&#x29;/)/g' | sed 's/^.*\([0-9][0-9]:[0-9][0-9] - Low Tide ([.0-9][.0-9]*m)\).*\([0-9][0-9]:[0-9][0-9] - High Tide ([.0-9][.0-9]*m)\).*\([0-9][0-9]:[0-9][0-9] - Low Tide ([.0-9][.0-9]*m)\).*\([0-9][0-9]:[0-9][0-9] - High Tide ([.0-9][.0-9]*m)\).*$/\1+\2+\3+\4/' | tr '+' '\n'
        Outputs this:

        00:34 - Low Tide (1.40m)
        06:43 - High Tide (11.60m)
        12:58 - Low Tide (1.40m)
        19:05 - High Tide (11.70m)

        P.S. Yes, it's "old fashioned" regex

        Comment


          #5
          If the feed is always well-formed XML, I'd parse it as such. Much as I love XSLT I think it might be overkill for this case: just parsing it into a DOM and yanking the data out of that directly would be sufficient.

          If the feed isn't always well-formed, which is often the case due to people thinking XML is "just text" (which it isn't) and can be generated by string concatenation (which it can, but not by most people), then a feedparser implementation would be a better choice: Universal Feed Parser should do the trick. On reflection, it's probably best just to use that, or one of its equivalents for your language of choice.

          Comment


            #6
            Perhaps a slightly controversial suggestion, but why not work the numbers out yourself?
            While you're waiting, read the free novel we sent you. It's a Spanish story about a guy named 'Manual.'

            Comment


              #7
              Originally posted by Pondlife View Post
              I have no idea how sed and awk work but I guess they are the tools for the job? What I want to do is be able to pull the High and low tide times and heights from inside the description tags using a shell script.

              Any ideas?

              TIA

              Pondy
              sed is not cool for parsing xml, but here ya go :
              Code:
              ~$ sed 's/&[^;]*;/\n/g;/^..:.. - .* Tide \n/{s/\n//;P};D' tide.rss
              00:34 - Low Tide 1.40m
              06:43 - High Tide 11.60m
              12:58 - Low Tide 1.40m
              19:05 - High Tide 11.70m
              ~$

              Comment


                #8
                Originally posted by Platypus View Post
                Code:
                cat $1 | grep "description.*Low" | sed 's/&#x28;/(/g' | sed 's/&#x29;/)/g' | sed 's/^.*\([0-9][0-9]:[0-9][0-9] - Low Tide ([.0-9][.0-9]*m)\).*\([0-9][0-9]:[0-9][0-9] - High Tide ([.0-9][.0-9]*m)\).*\([0-9][0-9]:[0-9][0-9] - Low Tide ([.0-9][.0-9]*m)\).*\([0-9][0-9]:[0-9][0-9] - High Tide ([.0-9][.0-9]*m)\).*$/\1+\2+\3+\4/' | tr '+' '\n'
                Originally posted by Contreras View Post
                Code:
                ~$ sed 's/&[^;]*;/\n/g;/^..:.. - .* Tide \n/{s/\n//;P};D' tide.rss
                Smart Arse

                Comment


                  #9
                  Cheers guys.

                  I also like Doodab's idea but can't find a free version of the tables or a formula.

                  Will have a looky at NFs feed parser but more as an exercise in technical masturbation since Platy and Contreras have solved it for me.

                  Rep will be forthcoming to all (Platy will have to wait a bit )

                  Comment


                    #10

                    Comment

                    Working...
                    X