Adding white line between text lines

Ahmed Bilal
Ahmed Bilal Published in 2018-01-12 21:34:41Z

![Hi, I am trying to do OCR using Tesseract overall results seems acceptable. The images are very very long receipts and we are scanning using scanner, the quality is better. Only issue is that in receipts few characters are joint between two lines]1

Please see the attached sample image. You may see in the first line character 'p' and in the second line character M are joint. This is causing problem in OCR. SO, the real question is may we add a white line or square between every text line ?

fmw42 Reply to 2018-01-13 04:04:10Z

You can do that for this image in Imagemagick by trimming the image to remove surrounding white and adding the same amount of black. Then average that image down to one column and looking for the brightest row. I start and stop 4 pixels from the top and bottom to avoid any really bright rows in those regions. Once I find the brightest row, I splice in 4 rows of white between the top and bottom regions divided by that row. This is not the most elegant way. But it shows the potential. One could likely pipe the list of row values to AWK and search for the max value in more efficient manner than saving to an array and using a for loop. Unix syntax with Imagemagick.


arr=(`convert text.png -fuzz 50% -trim -background black -flatten -colorspace gray -scale 1x! -depth 8 txt:- | tail -n +2 | sed -n 's/^.*gray[(]\(.*\)[)]$/\1/p'`)
#echo "${arr[*]}"
for ((i=4; i<num-4; i++)); do
max=`convert xc: -format "%[fx:$val>$max?$val:$max]" info:`
row=`convert xc: -format "%[fx:$val==$max?$i:$row]" info:`
#echo "$i $val $max $row"
convert text.png -gravity north -splice 0x4+0+$row text2.png

If you want less space, you can change to -splice 0x1+0+$row, but it won't change much. It is not writing over your image, but inserting white between the existing rows.

But by doing the processing above, your OCR still may not recognize the p or M, since the bottom of the p is cut off and appended to the M.

If you have more than two lines of text, you will have to search the column for approximately evenly spaced maxima.

