国会図書館サーチＡＰＩで書籍情報をまとめて取得－python

国会図書館の検索サービス（NDL Search）で書籍にかんする諸々の情報を取得できるようです。

検索の対象は国会図書館の蔵書だけではなく、全国各地のさまざまな研究機関、図書館の蔵書、さらにデジタルデータが含まれており、青空文庫のようなサービスも検索可能なようです。

検索には以下の６種類のプロトコルが用意されています。

SRU
SRW
OpenSearch
OpenURL
Z39.50
OAI-PMH

概略はこちら
国立国会図書館サーチについて
技術面はこちら
外部提供インタフェース（API）

- 目次 -

やりたいこと
500件の壁
OpenSearch
主な検索キー
XML
ソース
print 出力
人名、書名の一部が Shift-Jis にできない

やりたいこと

さっそく使ってみました。
やりたかったのは、キーワードに合致する書籍から、それぞれタイトルと著者、出版社、出版年をまとめて取りだし csv に保存するという処理です。しかし、残念ながら NDL Search には制限事項があり、500件しかデータが取得できないようです。

500件の壁

500件というのは、つまり
” 2000年以降に出版された「人工知能」に関する書籍 ” を検索し、仮に該当書籍が 1000冊あったとしても 500件しか情報を取得できないということです。
かなり残念

OpenSearch

仕様書をざっと見ただけですが、６種類のプロトコルのうち OpenSearch が簡単そうなのでこのプロトコルで実装することにします。

主な検索キー

検索キーの一部です。

dpid
データプロバイダ ID （データを提供するさまざまな機関に振られた ID）
指定すると、特定のデータにしぼって検索できる。
たとえば、青空文庫を指定する場合は「dpid=aozora」とする。

title
書籍タイトル。必ずしもタイトルだけではなく、シリーズ名や細目などとも突き合わせるようです。

creator
著者

from
開始出版年月日

until
終了出版年月日

mediatype

1：本
2：記事・論文
3：新聞
4：児童書
5：レファレンス情報
6：デジタル資料
7：その他
8：障害者向け資料（障害者向け資料検索対象資料）
9：立法情報

mediatype を指定しないとすべての資料が対象になるので、週刊誌記事や論文が大量に返ってきます。

cnt
出力レコード上限値
idx
レコード取得開始位置

XML

サーバーからの取得データは XML です。
item タグが本１冊に相当し、その下に title や author タグが並んでいます

root
 ┗ channel
      ┗ item
           ┗ title
           ┗ author
           ┗  ･･･

root

┗ channel

┗ item

┗ title

┗ author

┗ ･･･

ソース

idx をずらしながら検索を繰りかえして大量のデータを取得する作りにしましたが、サーバー側の制約条件があるので実際には 500件までです。

import numpy as np
from pandas import DataFrame
import xml.etree.ElementTree as ET
import requests
from collections import defaultdict


# 検索条件
params = {}
params['title']     = '人工知能'
params['mediatype'] = '1'
params['from']      = '1980-01-01'
params['cnt']       = '200'
params['idx']       = '1'

list_map = defaultdict(list)
total = 0

# セッション
s = requests.session()

while True:

    # 検索リクエスト
    r =  s.get('http://iss.ndl.go.jp/api/opensearch', params=params)

    # XML パース
    root = ET.fromstring(r.text.encode('utf-8'))
    print ('---------------------------------------------')
    print (root.find('channel').find('description').text)
    print ('---------------------------------------------')

    items = root.findall('.//item')
    for i, item in enumerate(items):
        print ('--------' + str(total+i+1) + '---------')
        
        # タイトル
        print (item.find('title').text)
        list_map['title'].append(item.find('title').text)

        # ID
        #   linkタグのテキストから抜き出す
        #     例
        #     <link>http://iss.ndl.go.jp/books/R100000001-I022140205-00</link>
        #                                              ↓
        #                                      R100000001-I022140205-00
        #
        link = item.find('link').text
        print (' ' + link[link.rfind('/')+1:])
        list_map['ID'].append(link[link.rfind('/')+1:])

        # 著者
        #   書式さまざま
        #　　　　例 
        #      ・ 夏目 漱石,
        #      ・ 夏目漱石 作,
        #      ・ 夏目, 漱石, 1867-1916,
        #      ・ 夏目漱石／著,
        #
        author = item.find('author')
        if author is not None:
            print (' ' + author.text)
            list_map['author'].append(author.text)
        else:
            list_map['author'].append('')
    
        # 出版日
        #     例 Fri, 23 Jun 1995 09:00:00 +0900
        pubDate = item.find('pubDate')
        if pubDate is not None:
            print (' ' + pubDate.text)
            list_map['pubDate'].append(pubDate.text)
        else:
            list_map['pubDate'].append('')

        # 発行年
        #   複数セットされるケースがあるので、もっとも古い年を取得する
        issueds = item.findall('{http://purl.org/dc/terms/}issued')
        lst = [issued.text for issued in issueds]
        if len(lst) > 0:
            print (' ' + lst[np.argmin(lst)])
            list_map['issued'].append(lst[np.argmin(lst)])
        else:
            list_map['issued'].append('')

        # シリーズタイトル
        #    文庫本の場合、ここに ○○文庫 とのっている
        seriesTitle = item.find('{http://ndl.go.jp/dcndl/terms/}seriesTitle')
        if seriesTitle is not None:
            print (' ' + seriesTitle.text)
            list_map['seriesTitle'].append(seriesTitle.text)
        else:
            list_map['seriesTitle'].append('')
    
        # 出版社    
        publisher = item.find('{http://purl.org/dc/elements/1.1/}publisher')
        if publisher is not None:
            print (' ' + publisher.text)
            list_map['publisher'].append(publisher.text)
        else:
            list_map['publisher'].append('')

    cnt = int(params['cnt'])
    idx = int(params['idx'])
    if len(items) < cnt:
        break

    params['idx'] = str(idx + cnt)
    total += cnt

df = DataFrame({'title'       : list_map['title'],
                'ID'          : list_map['ID'],
                'author'      : list_map['author'],
                'pubDate'     : list_map['pubDate'],
                'issued'      : list_map['issued'],
                'seriesTitle' : list_map['seriesTitle'],
                'publisher'   : list_map['publisher']}, 
                columns = ['title', 'ID', 'author', 'pubDate', 'issued', 'seriesTitle', 'publisher'])

df.to_csv("books.csv", encoding='utf-8')
#

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

import numpy as np

from pandas import DataFrame

import xml.etree.ElementTree as ET

import requests

from collections import defaultdict

# 検索条件

params = {}

params['title'] = '人工知能'

params['mediatype'] = '1'

params['from'] = '1980-01-01'

params['cnt'] = '200'

params['idx'] = '1'

list_map = defaultdict(list)

total = 0

# セッション

s = requests.session()

while True:

# 検索リクエスト

r = s.get('http://iss.ndl.go.jp/api/opensearch', params=params)

# XML パース

root = ET.fromstring(r.text.encode('utf-8'))

print ('---------------------------------------------')

print (root.find('channel').find('description').text)

print ('---------------------------------------------')

items = root.findall('.//item')

for i, item in enumerate(items):

print ('--------' + str(total+i+1) + '---------')

# タイトル

print (item.find('title').text)

list_map['title'].append(item.find('title').text)

# ID

# linkタグのテキストから抜き出す

# 例

# <link>http://iss.ndl.go.jp/books/R100000001-I022140205-00</link>

# ↓

# R100000001-I022140205-00

link = item.find('link').text

print (' ' + link[link.rfind('/')+1:])

list_map['ID'].append(link[link.rfind('/')+1:])

# 著者

# 書式さまざま

#　　　　例

# ・夏目漱石,

# ・夏目漱石作,

# ・夏目, 漱石, 1867-1916,

# ・夏目漱石／著,

author = item.find('author')

if author is not None:

print (' ' + author.text)

list_map['author'].append(author.text)

else:

list_map['author'].append('')

# 出版日

# 例 Fri, 23 Jun 1995 09:00:00 +0900

pubDate = item.find('pubDate')

if pubDate is not None:

print (' ' + pubDate.text)

list_map['pubDate'].append(pubDate.text)

else:

list_map['pubDate'].append('')

# 発行年

# 複数セットされるケースがあるので、もっとも古い年を取得する

issueds = item.findall('{http://purl.org/dc/terms/}issued')

lst = [issued.text for issued in issueds]

if len(lst) > 0:

print (' ' + lst[np.argmin(lst)])

list_map['issued'].append(lst[np.argmin(lst)])

else:

list_map['issued'].append('')

# シリーズタイトル

# 文庫本の場合、ここに ○○文庫とのっている

seriesTitle = item.find('{http://ndl.go.jp/dcndl/terms/}seriesTitle')

if seriesTitle is not None:

print (' ' + seriesTitle.text)

list_map['seriesTitle'].append(seriesTitle.text)

else:

list_map['seriesTitle'].append('')

# 出版社

publisher = item.find('{http://purl.org/dc/elements/1.1/}publisher')

if publisher is not None:

print (' ' + publisher.text)

list_map['publisher'].append(publisher.text)

else:

list_map['publisher'].append('')

cnt = int(params['cnt'])

idx = int(params['idx'])

if len(items) < cnt:

break

params['idx'] = str(idx + cnt)

total += cnt

df = DataFrame({'title' : list_map['title'],

'ID' : list_map['ID'],

'author' : list_map['author'],

'pubDate' : list_map['pubDate'],

'issued' : list_map['issued'],

'seriesTitle' : list_map['seriesTitle'],

'publisher' : list_map['publisher']},

columns = ['title', 'ID', 'author', 'pubDate', 'issued', 'seriesTitle', 'publisher'])

df.to_csv("books.csv", encoding='utf-8')

print 出力

画面に print で表示した結果です。最後の 5冊分。

--------496---------
知的教育システム研究会
 R100000039-I001134254-00
 人工知能学会,
 1996-03
 人工知能学会研究会資料
 人工知能学会
--------497---------
知恵のわ : 大人の教養読本
 R100000001-I070368770-00
 JBCCホールディングスLink編集室‖編著,
 2016
 日経BPコンサルティング
--------498---------
岩波講座情報科学
 R100000002-I000001555844-00
 Thu, 18 Dec 2003 09:00:00 +0900
 1982
 岩波書店
--------499---------
脳・心・人工知能 : 数理で脳を解き明かす
 R100000002-I027270968-00
 甘利俊一 著,
 Fri, 15 Jul 2016 09:00:00 +0900
 2016
 ブルーバックス ; B-1968
 講談社
--------500---------
人工知能教科書 : 主要分野をコンパクトに解説
 R100000002-I023548954-00
 赤間世紀 著,I O編集部 編集,
 Thu, 12 Jul 2012 09:00:00 +0900
 2012
 I/O BOOKS
 工学社

--------496---------

知的教育システム研究会

R100000039-I001134254-00

人工知能学会,

1996-03

人工知能学会研究会資料

人工知能学会

--------497---------

知恵のわ : 大人の教養読本

R100000001-I070368770-00

JBCCホールディングスLink編集室‖編著,

2016

日経BPコンサルティング

--------498---------

岩波講座情報科学

R100000002-I000001555844-00

Thu, 18 Dec 2003 09:00:00 +0900

1982

岩波書店

--------499---------

脳・心・人工知能 : 数理で脳を解き明かす

R100000002-I027270968-00

甘利俊一著,

Fri, 15 Jul 2016 09:00:00 +0900

2016

ブルーバックス ; B-1968

講談社

--------500---------

人工知能教科書 : 主要分野をコンパクトに解説

R100000002-I023548954-00

赤間世紀著,I O編集部編集,

Thu, 12 Jul 2012 09:00:00 +0900

2012

I/O BOOKS

工学社

人名、書名の一部が Shift-Jis にできない

csvファイルにする際、あとでエクセルで開きたいのでエンコーディングは Shift-Jis にするつもりでしたが、エラーになりました。人名、書名の漢字に Shift-Jis 以外の文字が結構あるからです。たとえば

李世乭

アルファ碁と闘った囲碁チャンピョンです。

Shift-Jis はあきらめて utf-8 で保存しました。

コード７区

機械学習とかpythonとか

国会図書館サーチＡＰＩで書籍情報をまとめて取得－python

やりたいこと

500件の壁

OpenSearch

主な検索キー

XML

ソース

print 出力

人名、書名の一部が Shift-Jis にできない

コメントはお気軽にコメントをキャンセル

やりたいこと

500件の壁

OpenSearch

主な検索キー

XML

ソース

print 出力

人名、書名の一部が Shift-Jis にできない

コメントはお気軽に コメントをキャンセル

コメントはお気軽にコメントをキャンセル