弄浪的鱼

自建数据集

(1) 学会使用爬虫爬取图像和视频,从视频中提取图片。

(2) 对获得的图片数据进行整理,包括重命名,格式统一,去重。

爬取图片

有些任务没有直接对应的开源数据集,或者开源数据集中的数据比较少,这就需要我们通过搜索引擎自行爬取图片。

百度图片爬虫

Download images from Google, Bing, Baidu. 谷歌、百度、必应图片下载

Google, Naver multiprocess image web crawler (Selenium)

数据集整理

爬取的如果是视频需要先转换成图片,如果是图片就要做好统一格式、数据清洗的工作。

视频转换成图片

使用爬虫爬取数据,如果是视频可以使用 python getimagefromvideo.py <video_path> 将视频转换为图片

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#coding:utf8
import cv2
import dlib
import numpy as np
import sys
import os

video_capture = cv2.VideoCapture(sys.argv[1])
video_id = sys.argv[1].split('.')[0]
os.mkdir(video_id)

count = 0
while True:
is_sucessfully_read, im = video_capture.read()
if is_sucessfully_read == False:
break
cv2.imwrite(os.path.join(video_id,str(count)+'.jpg'),im)
print "image shape=",im.shape
count = count + 1
print count

统一图片后缀格式

统一后缀格式可以减少以后写数据 API 时的压力,也可以测试图片是不是可以正常的读取,及时防止未知问题的出现,这很重要。

使用 python reformat_image.py <images_folder_path> 将图片全部转换为 jpg 格式,这也是所有框架支持的格式。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import os
import sys
import cv2
import numpy as np

def listfiles(rootDir):
list_dirs = os.walk(rootDir)
for root, dirs, files in list_dirs:
for d in dirs:
print os.path.join(root,d)
for f in files:

fileid = f.split('.')[0]

filepath = os.path.join(root,f)
try:
src = cv2.imread(filepath,1)
print "src=",filepath,src.shape
os.remove(filepath)
cv2.imwrite(os.path.join(root,fileid+".jpg"),src)
except:
os.remove(filepath)
continue

listfiles(sys.argv[1])

按格式重命名图片

统一格式的命名有利于区分和整理数据

1
2
mkdir tmp
./rename_files_function.sh <images_folder_path> ./tmp/ <label>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
i=0
dir=$1
resultdir=$2
app=$3
for file in $dir""*
do
arr=$(echo $file | tr "/" "\n")
for x in $arr
do
filename=$x
done

brr=$(echo $filename | tr "." "\n")
brrs=( $brr )
fileid=${brrs[0]}

num=${#brrs[@]}
index=$(expr $num - 1)
fileformat=${brrs[index]}

echo file=""$file
echo fileid=""$fileid
echo fileformat=""$fileformat

if [ $fileformat == jpeg -o $fileformat == png -o $fileformat == jpg -o $fileformat == bmp ] ;
then
#echo "good"
i=$(expr $i + 1)
resultfile=$resultdir""$app""$i"".$fileformat
echo file=""$file"",resultfile=""$resultfile
mv "$file" "$resultfile"
else
echo $file""not good
fi
done

echo 执行删除""$dir""*
#rm $dir""*
echo 执行mv""$resultdir""*
mv $resultdir""* $dir

去重

如果你使用多个关键词或者使用不同的搜索引擎同样的关键词,或者从视频中提取图片,那么爬取回来的图片很可能有重复或者非常的相似,这样的样本应该被去除。

去除有很多种方法,比如直接比较两幅图像是不是完全相同,通过 hash 等相似度方法来进行相似度,这里我们提供一个方法,利用相似度来进行去重。

1
2
3
4
5
# sudo pip install python-Levenshtein
conda install -c conda-forge python-levenshtein
python remove_repeat.py <image_path>

https://anaconda.org/conda-forge/python-levenshtein
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
#!/usr/bin/env python
#coding:utf8
import math

from PIL import Image
import Levenshtein

class BWImageCompare(object):
"""Compares two images (b/w)."""

_pixel = 255
_colour = False

def __init__(self, imga, imgb, maxsize=64):
"""Save a copy of the image objects."""

sizea, sizeb = imga.size, imgb.size

newx = min(sizea[0], sizeb[0], maxsize)
newy = min(sizea[1], sizeb[1], maxsize)

# Rescale to a common size:
imga = imga.resize((newx, newy), Image.BICUBIC)
imgb = imgb.resize((newx, newy), Image.BICUBIC)

if not self._colour:
# Store the images in B/W Int format
imga = imga.convert('I')
imgb = imgb.convert('I')

self._imga = imga
self._imgb = imgb

# Store the common image size
self.x, self.y = newx, newy

def _img_int(self, img):
"""Convert an image to a list of pixels."""

x, y = img.size

for i in xrange(x):
for j in xrange(y):
yield img.getpixel((i, j))

@property
def imga_int(self):
"""Return a tuple representing the first image."""

if not hasattr(self, '_imga_int'):
self._imga_int = tuple(self._img_int(self._imga))

return self._imga_int

@property
def imgb_int(self):
"""Return a tuple representing the second image."""

if not hasattr(self, '_imgb_int'):
self._imgb_int = tuple(self._img_int(self._imgb))

return self._imgb_int

@property
def mse(self):
"""Return the mean square error between the two images."""

if not hasattr(self, '_mse'):
tmp = sum((a-b)**2 for a, b in zip(self.imga_int, self.imgb_int))
self._mse = float(tmp) / self.x / self.y

return self._mse

@property
def psnr(self):
"""Calculate the peak signal-to-noise ratio."""

if not hasattr(self, '_psnr'):
self._psnr = 20 * math.log(self._pixel / math.sqrt(self.mse), 10)

return self._psnr

@property
def nrmsd(self):
"""Calculate the normalized root mean square deviation."""

if not hasattr(self, '_nrmsd'):
self._nrmsd = math.sqrt(self.mse) / self._pixel

return self._nrmsd

@property
def levenshtein(self):
"""Calculate the Levenshtein distance."""

if not hasattr(self, '_lv'):
stra = ''.join((chr(x) for x in self.imga_int))
strb = ''.join((chr(x) for x in self.imgb_int))

lv = Levenshtein.distance(stra, strb)

self._lv = float(lv) / self.x / self.y

return self._lv


class ImageCompare(BWImageCompare):
"""Compares two images (colour)."""

_pixel = 255 ** 3
_colour = True

def _img_int(self, img):
"""Convert an image to a list of pixels."""

x, y = img.size

for i in xrange(x):
for j in xrange(y):
pixel = img.getpixel((i, j))
yield pixel[0] | (pixel[1]<<8) | (pixel[2]<<16)

@property
def levenshtein(self):
"""Calculate the Levenshtein distance."""

if not hasattr(self, '_lv'):
stra_r = ''.join((chr(x>>16) for x in self.imga_int))
strb_r = ''.join((chr(x>>16) for x in self.imgb_int))
lv_r = Levenshtein.distance(stra_r, strb_r)

stra_g = ''.join((chr((x>>8)&0xff) for x in self.imga_int))
strb_g = ''.join((chr((x>>8)&0xff) for x in self.imgb_int))
lv_g = Levenshtein.distance(stra_g, strb_g)

stra_b = ''.join((chr(x&0xff) for x in self.imga_int))
strb_b = ''.join((chr(x&0xff) for x in self.imgb_int))
lv_b = Levenshtein.distance(stra_b, strb_b)

self._lv = (lv_r + lv_g + lv_b) / 3. / self.x / self.y

return self._lv


class FuzzyImageCompare(object):
"""Compares two images based on the previous comparison values."""

def __init__(self, imga, imgb, lb=1, tol=15):
"""Store the images in the instance."""

self._imga, self._imgb, self._lb, self._tol = imga, imgb, lb, tol

def compare(self):
"""Run all the comparisons."""

if hasattr(self, '_compare'):
return self._compare

lb, i = self._lb, 2

diffs = {
'levenshtein': [],
'nrmsd': [],
'psnr': [],
}

stop = {
'levenshtein': False,
'nrmsd': False,
'psnr': False,
}

while not all(stop.values()):
cmp = ImageCompare(self._imga, self._imgb, i)

diff = diffs['levenshtein']
if len(diff) >= lb+2 and \
abs(diff[-1] - diff[-lb-1]) <= abs(diff[-lb-1] - diff[-lb-2]):
stop['levenshtein'] = True
else:
diff.append(cmp.levenshtein)

diff = diffs['nrmsd']
if len(diff) >= lb+2 and \
abs(diff[-1] - diff[-lb-1]) <= abs(diff[-lb-1] - diff[-lb-2]):
stop['nrmsd'] = True
else:
diff.append(cmp.nrmsd)

diff = diffs['psnr']
if len(diff) >= lb+2 and \
abs(diff[-1] - diff[-lb-1]) <= abs(diff[-lb-1] - diff[-lb-2]):
stop['psnr'] = True
else:
try:
diff.append(cmp.psnr)
except ZeroDivisionError:
diff.append(-1) # to indicate that the images are identical

i *= 2

self._compare = {
'levenshtein': 100 - diffs['levenshtein'][-1] * 100,
'nrmsd': 100 - diffs['nrmsd'][-1] * 100,
'psnr': diffs['psnr'][-1] == -1 and 100.0 or diffs['psnr'][-1],
}

return self._compare

def similarity(self):
"""Try to calculate the image similarity."""

cmp = self.compare()

lnrmsd = (cmp['levenshtein'] + cmp['nrmsd']) / 2
return lnrmsd
return min(lnrmsd * cmp['psnr'] / self._tol, 100.0) # TODO: fix psnr!


if __name__ == '__main__':

import sys
import os

srcimages = os.listdir(sys.argv[1])
srcimages.sort()

tot = len(srcimages)
tot = (tot ** 2 - tot) / 2

print 'Comparing %d images:' % tot

images = {}

###向后删除图片
similarity_thresh = 0.5 ##相似度阈值,超过即判断为相同图片
i = 0
while(i < len(srcimages)-1):
print "i=", i,"num of srcimages",len(srcimages)

imga = Image.open(os.path.join(sys.argv[1],srcimages[i]))
imgb = Image.open(os.path.join(sys.argv[1],srcimages[i+1]))
cmp = FuzzyImageCompare(imga, imgb)
sim = cmp.similarity() / 100
print "image ",os.path.join(sys.argv[1],srcimages[i])," and image",os.path.join(sys.argv[1],srcimages[i+1])," sim=",sim
if sim > similarity_thresh:
print "delete ",os.path.join(sys.argv[1],srcimages[i+1])
os.remove(os.path.join(sys.argv[1],srcimages[i+1]))
srcimages.pop(i+1)
else:
i = i+1

'''
results, i = {}, 1
for namea, imga in images.items():
for nameb, imgb in images.items():
if namea == nameb or (nameb, namea) in results:
continue
print ' * %2d / %2d:' % (i, tot),
print namea, nameb, '...',
cmp = FuzzyImageCompare(imga, imgb)
sim = cmp.similarity()
results[(namea, nameb)] = sim
print '%.2f %%' % sim
i += 1
res = max(results.values())
imgs = [k for k, v in results.iteritems() if v == res][0]
print 'Most similar images: %s %s (%.2f %%)' % (imgs[0], imgs[1], res)

'''

在此之后还需要自己手动筛选图片,工作量其实也不小,不过经过去重还是可以减少不少工作量的。

数据集标注

爬取的图片需要自己标注,可以使用下面这些标注工具。

https://github.com/tzutalin/labelImg

LabelImg is a graphical image annotation tool and label object bounding boxes in images https://youtu.be/p0nR2YsCY_U

https://github.com/wkentaro/labelme

Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation).

https://github.com/Microsoft/VoTT

Visual Object Tagging Tool: An electron app for building end to end Object Detection Models from Images and Videos.

数据集划分

一般会按照 8:1:1 将数据集划分为训练集、验证集、测试集。这个要根据自己的情况编写 shell 脚本,下面是我用 darknet 训练 yolov3 模型时划分数据的脚本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/bin/sh

if [ $# != 1 ];then
echo "Usage: $0 <full path>"
exit -1
fi

path=$1

for sub_dir in `ls $path`
do
# 获取子文件夹的全路径
sub_dir_path=$path/$sub_dir
if [ -d $sub_dir_path ]
then
# 将子目录下所有文件移动到父目录中
`mv $sub_dir_path/* $path`
# 删除子目录
`rm -rf $sub_dir_path`
fi

# 给所有文件添加前缀

done

`rm tmp.txt`
# 将文件夹下指定类型的文件写到文件中
# ***** 问题:最后会有个空行 *****
# 图片文件存在对应的 txt 文件,则将图片路径追加到 tmp.txt 文件中
for image in `find $path | grep -E 'jpg|png|JPEG|JPG|PNG'`
do

txt=${image%.*}".txt"

if [ -f $txt ]
then
echo ${image}
`echo ${image} >> tmp.txt`
fi

done

# 将路径 8:1:1 放到 train.txt,val.txt,test.txt
# 1. 计算 tmp,txt 文件行数
# 2. 计算得出分配到各个文件的行号
# 3. 将对应行数的内容写到对应文件夹中
line=`cat tmp.txt | wc -l`
line1=$(($line/10*8))
line2=$(($line/10*8+line/10+1))

`sed -n 1,${line1}p tmp.txt >> train.txt`
`sed -n $((${line1}+1)),${line2}p tmp.txt >> val.txt`
`sed -n $((${line2}+1)),$((${line}-1))p tmp.txt >> test.txt`

yolov3 的标注格式如下所示

1
9 0.732955 0.591102 0.270317 0.193503

统计标签的时候可以使用

1
awk '{print $1}' *.txt | sort -g | uniq -c

以上就是自己建立一个数据集的流程:爬取图片->整理图片->标注图片->训练。