crawl-urls
The command crawl the ameblo URLs given in the arguments. The command is useful to crawl specific URLs for debugging or operational purposes.
At first, a client (Cloud Scheduler or any other schedulers) calls CrawlRssFeedsTask
without any parameters. This task identifies the RSS feed URLs from HPAsset table,
which is http://rssblog.ameba.jp/{key}/rss20.xml
format and fetch RSS XML documents from those URLs. Then it extracts blog post URLs from the RSS XML and then optimize
URLs to crawl by comparing HPAmebloPost entries.
Once the task identify the list of URLs that need to be crawled, it split the list into chunks and then call CrawlUrlsTask
for each chunk.
This task simply crawls each URL in the argument. It is expected to be called from CrawlRssFeedsTask
or crawl-url
command.
This is an operational task to crawl specific posts in HPAmebloPost
ent. For example, if you change the crawling logic and want to update the fields in the older posts,
you can flip recrawl_required
flag to TRUE
in ents, and then call this task to crawl those posts.