plugins – Need help creating asynchronous data scraper in WordPress

Question

I need some help on a data scraping script I’ve been working on. I want to create a script through which content can be extracted from a page.

I fetched the page via wp_remote_get() and wp_remote_retrieve_body(). Using the DOMDocument() class, I was able to target the specific elements and upload them in database afterwards. Please tke a look at the following code –

$url = "https://www.myexampleurl.com/page-to-be-scraped";

 $data = wp_remote_get( $url,
        array(
            'timeout'   =>  60
        )
    );
    $body = wp_remote_retrieve_body( $data );

    $dom = new DOMDocument();
    $dom->loadHTML( $body );

    $xpath = new DomXPath( $dom );
    $xpath->registerNamespace( 'm', $url );

    //The page has a center tag, believe it or not
    $center = $dom->getElementsByTagName('center')->item(0);

    //Targeting all links
    $query = '//a';
    $entries = $xpath->query( $query, $center );

    $count = 1;
    foreach ($entries as $entry) {

        //The target elements have 'data-lightbox' attribute
        $attr = $entry->attributes->getNamedItem( 'data-lightbox' );

        //Uploading the sibling attribute to 'data-lightbox'
        if ( !empty( $attr ) ) {

            //The data fetched is uploaded to dtabase using this function
           my_upload_file_by_url( $attr->previousSibling->nodeValue );

        }
    }

Now, what I need to know is how to make this request asynchronous. I tried AJAX but it gets timed out and throws an error.
Also, $dom->loadHTML( $body ) throws an error as follows –

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 241 in /Users/apple/Sites/indidev/wp-content/plugins/crawler/crawler.php on line 56

Also tried wp_schedule_single_event function but it also doesn’t function as expected. Any pointers are appreciated!

PS – There are many pages that need to be scraped and simultaneously inserted in the database.

0
Divjot Singh 1 month 2022-08-19T15:27:04-05:00 0 Answers 0 views 0

Leave an answer

Browse
Browse