rvest Crawl Website problems with java script

steve215 · May 9, 2019, 8:39pm

Hello, I want to scrap a website for demo purposes.

The code of the website is:

<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" ng-app="TUMNewsApp"><head><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta charset="UTF-8"><meta name="robots" content="index, follow"><meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"><meta name="google-site-verification" content="NhJYhvGtruZZ2iJmsYz1KuofthSHyl3icQQYT3wba6k" /><link rel="shortcut icon" 

[...]

target="_blank">tum.school.of.management</a> </span></div><div class="social_item"> <i class="icon-icn_tum_youtube"></i> <span> Subscribe to our channel!<br> <a href="//www.youtube.com/channel/UCXdFu0pi275lddSR1HjLg8A" target="_blank">TUM School of Management</a> </span></div></div><div class="social rss"> <a target="_blank" href="https://www.wi.tum.de/feed/rss/"><div class="social_item"> <i class="icon-icn_tum_RSS"></i> <span> <b>NEWS RSS abonnieren</b> </span></div> </a></div></div></div></div> <script>(function ( $ ) {
        $( document ).ready( function () {
            $('.feature_8871').slick({
                infinite: false,
                slidesToShow: 1,
                slidesToScroll: 1,
                arrows: true,
                dots: true,
                autoplay: true,
                autoplaySpeed: 5000
            });

            var count = $('.feature_8871 .slick-dots li').length;
            $(".feature_8871 .slick-dots li").each(function () {
                $(this).find("button").remove();
            });

        });
    })( jQuery );</script> </div></div></div></div><div class="vc_row wpb_row row bg_default"><div class="wpb_column vc_column_container vc_col-sm-12 col-xs-12 col-sm-12"><div class="vc_column-inner"><div class="wpb_wrapper"><div class="dhsv_vc_anker point"><div id="News-Archive" data-ankername="News Archive" class="ankerpoint"></div></div><div class="wpb_text_column wpb_content_element " ><div class="wpb_wrapper"><h3>News Archive</h3></div></div> <script language='javascript'>var beitraege =  [ {
        ID:'59888',
        url:'https://www.wi.tum.de/wp-content/uploads/2019/02/Fotolia_208486536_S-300x169.jpg',
        category:' <span>International</span> ',
        tag:[45],
        permalink: 'https://www.wi.tum.de/tum-ranked-in-the-first-league-with-study-quality/',
        date:'8 May 2019',
        title:'TUM ranked in the first league with study quality',
        exerpt: 'CHE University Ranking: Students rate engineering programs Students give the Technical University of Munich (TUM) many high marks. This is seen in the latest rankings from the Centre for Higher … <br>Read More here <i class="icon-icn_tum_internlink"></i>'

    }, {
        ID:'59594',
        url:'https://www.wi.tum.de/wp-content/uploads/2018/04/20170323_bwl_Flyer_AH_311912-300x200.jpg',
        category:' <span>General</span>  <span>Student Life</span>  <span>Alumni</span> ',
        tag:[15, 41, 222],
        permalink: 'https://www.wi.tum.de/applications-open-for-the-social-impact-award-2019/',
        date:'4 May 2019',
        title:'Applications open for the Social Impact Award 2019',
        exerpt: 'TUM School of Management students and graduates can now submit their projects for the Social Impact Award 2019. If your project study, Bachelor’s or Master’s thesis tackles a social issue … <br>Read More here <i class="icon-icn_tum_internlink"></i>'

    }, {
        ID:'59614',
        url:'https://www.wi.tum.de/wp-content/uploads/2019/05/Fotolia_261607955_S_WP-sized-300x169.jpg',
        category:' <span>General</span>  <span>Studies</span> ',
        tag:[15, 91],
        permalink: 'https://www.wi.tum.de/subject-with-high-returns-why-business-studies-at-universities-in-germany-must-not-be-weakened-by-prof-dr-friedl-and-prof-dr-hutzschenreuter/',
        date:'3 May 2019',
        title:'Subject with high returns &#8211; Why business studies at universities in Germany must not be weakened by Prof. Dr. Friedl and Prof. Dr. Hutzschenreuter',
        exerpt: 'On April 18th 2019, the Frankfurter Allgemeine Zeitung published an article by Prof. Dr. Gunther Friedl and Prof. Dr. Thomas Hutzschenreuter why business studies at universities in Germany must be … <br>Read More here <i class="icon-icn_tum_internlink"></i>'

    }, {

[...]

I want to save the different events into a data frame.
For example:


[...]

ID:'59888',
        url:'https://www.wi.tum.de/wp-content/uploads/2019/02/Fotolia_208486536_S-300x169.jpg',
        category:' <span>International</span> ',
        tag:[45],
        permalink: 'https://www.wi.tum.de/tum-ranked-in-the-first-league-with-study-quality/',
        date:'8 May 2019',
        title:'TUM ranked in the first league with study quality',
        exerpt: 'CHE University Ranking: Students rate engineering programs Students give the Technical University of Munich (TUM) many high marks. This is seen in the latest rankings from the Centre for Higher … <br>Read More here <i class="icon-icn_tum_internlink"></I>'

[...]

But I can't scrape the website with CSS or Xpath.
I tried this:

tumNews <-read_html("https://www.wi.tum.de/about-2/news-events/")

tumNews %>%
  html_nodes(".boxview , .excerpt , .ng-binding+ .ng-binding") %>%
  html_text()

But I don't get the values I am looking for. Is there someone who can help me? Thanks in advance!

billyi · May 10, 2019, 5:32am

You might want to check this reply

steve215 · May 10, 2019, 9:42am

Thanks a lot for your help! I used phantom.js.
I was able to get the headlines, but not more. Even if I get the code into Rstudio, how I can select the ID for example in my case?

I used this code:

eventNodes <- html_nodes(events,".ng-binding")
html_text(eventNodes)

but getting only the headlines.

ID:'59888',
        url:'https://www.wi.tum.de/wp-content/uploads/2019/02/Fotolia_208486536_S-300x169.jpg',
        category:' <span>International</span> ',
        tag:[45],
        permalink: 'https://www.wi.tum.de/tum-ranked-in-the-first-league-with-study-quality/',
        date:'8 May 2019',
        title:'TUM ranked in the first league with study quality',

I want to save each of the items.

My second problem is, that only the first elements are loaded, because you have to click on load more. Is there also a solution for that?

Thanks a lot in advance!

system · May 31, 2019, 9:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.