Requirements
转换过程中需要用到PyGlossary,直接去github克隆仓库,或者直接去releases页面下载最新版。作者在这里已经给出了Requirements. 我电脑上配置了anaconda环境,通过brew
安装好gtk+3和pygobject3之后,通过Gtk3-based interface运行程序一直报错。而且Tkinter-based interface也是报错,懒得折腾虚拟环境了。所以索性就直接用命令行运行了。
以下所有操作都是基于macOs Big Sur 11.1进行的
如果需要用过Command-line interface运行的话,可以通过执行以下命令。
pip3 install prompt_toolkit python3 main.py –ui=cmd
我是直接通过命令行逐步运行的。
下面的链接是官方给出的例子。
https://github.com/ilius/pyglossary/blob/master/doc/apple.md
除了这几个包,还需要下载Command Line Tools for Xcode,直接登录自己的apple id就可以下载。
找到Additional Tools for Xcode,下载解压。Dictionary Development Kit是Additional Tools for Xcode的一部分。在解压好的Additional Tools for Xcode文件夹里面找到Dictionary Development Kit,将它放到~/Developer/Extras/文件夹下。
这一步只是为了省事,避免在接下来的步骤中需要手动调整Makefile里面的路径。
Convert Mdict to Mac OS X dictionary
建议大家把一个词典和它相对应的所有资源文件全部放到一个文件夹下。
假设我们已经有了一个词典文件,在~/Downloads/oald8/oald8.mdx
里面。跟mdx
在同一个文件夹里面的还有mdd
资源文件。
假设我们把最新版的pyglossary解压在了这个文件夹~/Software/pyglossary
1
2
3
|
cd ~/Downloads/oald8/
python3 ~/Software/pyglossary/main.py --write-format=AppleDict oald8.mdx oald8-apple
cd oald8-apple
|
我是将pyglossary文件夹和词典文件夹放在同一个目录下。所以我的命令是这样的:
1
2
|
python3 ./main.py --write-format=AppleDict ../oald8/oald8.mdx ../oald8-apple
cd ../oald8-apple
|
这个文件夹里面oald8.xml
放的是词典文件,其他资源文件都在OtherResources
里面。
需要把所有oald8.xml
里面相对链接里面的"\“去掉。
sed -i "” ’s:src="/:src=":g' oald8.xml
然后还有一个问题是音频文件,原始的音频文件里面有不少spx格式,原作者说的是需要用speex转换,实际上并不需要。直接改一下后缀名就可以了。修改后缀名可以进入到文件夹里面,全选右键,重命名。然后选择后缀即可。改成wav或者mp3都行。我改的是mp3。
1
2
|
find OtherResources -name "*.spx" -execdir sh -c 'spx={};speexdec $spx ${spx%.*}.wav' \;
sed -i "" 's|sound://\([/_a-zA-Z0-9]*\).spx|\1.wav|g' oald8.xml
|
spx文件一般都在OtherResources/media/spx
路径下。接下里就是把词典文件oald8.xml
里面所有的引用全部改掉。
打脸了,直接改后缀在词典里面是播放不出来的,还是需要使用speex。安装speex的过程中需要从源码编译安装,使用port install speex命令是不行的。很多人说还需要将wav转换成mp3,实际上是不需要的。亲测。当然如果需要优化存储的话还是需要使用mp3。wav比spx甚至大了几十倍。。。下面是我转换Cambridge English Pronouncing Dictionary - 18th Edition得到的一张对比图:
sed -i "" ’s|sound://([/_a-zA-Z0-9]*).spx|\1.wav|g' oald8.xml
以上两步实际上都是需要自己查看xml里面的代码,看看是以那种形式出现的。并不一定都跟上面相同。我在最后会给出我转换不同词典的过程中用到的正则表达式。
最后一步就是编译,安装。
make过程中可能会出现一些问题:
~/Developer/Extras/Dictionary Development Kit/bin/build_dict.sh: line 221: /bin/cp: Argument list too long
解决方案是将build_dict.sh
的第221行改为下面的代码,重新执行make。
rsync -a “$OTHER_RSRC_DIR”/ “$OBJECTS_DIR”/dict.dictionary/"$CONTENTS_DATA_PATH" || error “Error.”
完成之后,会在当前文件夹下生成一个objects文件夹,里面就有我们所需要的.dictionary文件。
原生词典里面是不支持base64
格式的字体文件的,所以我们需要手动将原先css里面base64格式的字体转换成它原来的样子并放到Contents
文件夹下。
下面这个网站可以提供在线转换服务
Base64 Online - base64 decode and encode
转换过程如下
默认css的路径如下所示:
objects/oald8-apple.dictionary/Contents/DefaultStyle.css
打开DefaultStyle.css
, 在里面查找@font-face
,找到里面的将对应字体的base64编码复制到上面那个网站里面,选择decode。输出栏目里面选择export to a binary,其实后面的文件名可以直接以ttf对应的字体格式结尾。例如:ttf,woff之类的。前缀就写成原先的字体名。下面代码中font-family就是对应的字体名。
下面我举一个修改之前和修改之后的例子吧。
1
2
3
4
5
6
7
8
9
10
11
12
13
|
@font-face {
font-family: 'lm5pp_icomoon';
src: url(data:font/truetype;charset=utf-8;base64,AAEAAAAOAIAAAwBgRkZUTX/w9zYAAAakAAAAHEdERUYAJwANAAAGhAAAAB5PUy8yDxcGTwAAAWgAAABgY21hcOpwLgYAAAHkAAABZmdhc3AAAAAQAAAGfAAAAAhnbHlmVsOtfQAAA1wAAAE0aGVhZA5ZovkAAADsAAAANmhoZWEHMgPIAAABJAAAACRobXR4DUkAAAAAAcgAAAAcbG9jYQA4ALoAAANMAAAAEG1heHAACwBWAAABSAAAACBuYW1l+lhN2AAABJAAAAGbcG9zdIDVhT8AAAYsAAAAT3dlYmaPmVnAAAAGwAAAAAYAAQAAAAAAADy/p+JfDzz1AAsEAAAAAADV5h5jAAAAANXmQBgAAAAAA3ADgQAAAAgAAgAAAAAAAAABAAADwP/AAAAEAAAAAAADcAABAAAAAAAAAAAAAAAAAAAABwABAAAABwBUAAMAAAAAAAIAAAAAAAAAAAAAAAAAAAAAAAMC/QGQAAUABAKZAswAAACPApkCzAAAAesAMwEJAAAAAAAAAAAAAAAAAAAAARAAAAAAAAAAAAAAAAAAAAAAQAAB6icDwP/AAEADwABAAAAAAQAAAAAAAAAAAAAAIAABBAAAAAAAAAABVQAAAAAAAAIAAAAB9AAABAAAAAAAAAMAAAADAAAAHAABAAAAAABgAAMAAQAAABwABABEAAAADAAIAAIABAABACAl/Oon//3//wAAAAAAICX86if//f//AAD/5NoJFd8AAwABAAwAAAAAAAAAAAAAAAEAAwAAAQYAAAEDAAAAAAAAAQIAAAACAAAAAAAAAAAAAAAAAAAAAQAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAIABAAGAAgAJoAAQAAAAAAAAAAAAIAADkCAAEAAAAAAAAAAAACAAA5AgABAAAAAAAAAAAAAgAAOQIAAQAAAAAAAAAAAAMAADkDAAMAAAAAA3ADgQAfADgAUwAAJSImJyY0Nz4BNCYnJjQ3NjIXHgMVFA4CBw4BIzEnIiYnJjQ3PgE0JicmNDc2MhceARQGBw4BByImLwEjIiY1ETQ2OwE3PgEXHgEVERQGBw4BAtAKEQcODjExMTEODg4nDh8vIBERIC8fBxEJqwkSBw4OHh8fHg4ODigOLC0tLAcSjgYMBfZzDRMTDXP2BxMJCQsLCQMGgAcIDicOMnuCezIOJw4PDx5HTVQrK1RNRx4IB1sHBw4oDh5NUE0eDigODg4scXRxLAcH2wUE9xMNAUANE/cGBAMEEAr8wAoQBAEBAAAAAA4ArgABAAAAAAABAAcAEAABAAAAAAACAAcAKAABAAAAAAADAAcAQAABAAAAAAAEAAcAWAABAAAAAAAFAAsAeAABAAAAAAAGAAcAlAABAAAAAAAKABoA0gADAAEECQABAA4AAAADAAEECQACAA4AGAADAAEECQADAA4AMAADAAEECQAEAA4ASAADAAEECQAFABYAYAADAAEECQAGAA4AhAADAAEECQAKADQAnABpAGMAbwBtAG8AbwBuAABpY29tb29uAABSAGUAZwB1AGwAYQByAABSZWd1bGFyAABpAGMAbwBtAG8AbwBuAABpY29tb29uAABpAGMAbwBtAG8AbwBuAABpY29tb29uAABWAGUAcgBzAGkAbwBuACAAMQAuADAAAFZlcnNpb24gMS4wAABpAGMAbwBtAG8AbwBuAABpY29tb29uAABGAG8AbgB0ACAAZwBlAG4AZQByAGEAdABlAGQAIABiAHkAIABJAGMAbwBNAG8AbwBuAC4AAEZvbnQgZ2VuZXJhdGVkIGJ5IEljb01vb24uAAAAAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHAAABAgACAQMAAwEEAQUGZ2x5cGgxB3VuaTAwMDEHdW5pMjVGQwd1bmlFQTI3AAABAAH//wAPAAEAAAAMAAAAFgAAAAIAAQABAAYAAQAEAAAAAgAAAAAAAAABAAAAANQkmLoAAAAA1eYeYwAAAADV5kAYAAFZwI+YAAA=) format('truetype');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'cjkextent';
src: url(data:font/truetype;charset=utf-8;base64,AAEAAAAOAIAAAwBgRkZUTXxzNGMAAAygAAAAHEdERUYAJwAOAAAMgAAAAB5PUy8yVxjhrwAAAWgAAABgY21hcEymT7AAAAHkAAABWmdhc3AAAAAQAAAMeAAAAAhnbHlmDIR02gAAA1QAAAZYaGVhZAhq7tgAAADsAAAANmhoZWEB1gHWAAABJAAAACRobXR4BFsALQAAAcgAAAAabG9jYQTgA1wAAANAAAAAEm1heHAAEwC3AAABSAAAACBuYW1lSV8NJAAACawAAAJvcG9zdDY8ImAAAAwcAAAAWndlYmbBa1pNAAAMvAAAAAYAAQAAAAYUfDZcJ6tfDzz1AAsBAAAAAADR2ym+AAAAANZzceoAAP/jAPgA1AAAAAgAAgAAAAAAAAABAAAA3P/cAAAB9AAAAAAA+AABAAAAAAAAAAAAAAAAAAAABQABAAAACAC1AAoAAAAAAAIAAAAAAAAAAAAAAAAAAAAAAAQBMAGQAAUABACAAIAAAAAQAIAAgAAAAIAADABBAAACAQYJBgEBAQEBgAACgxAAAAAAAAAWAAAAACAgICAAACX8JkAA3P/cACQA3AAkQAAAAcDWAAAAdQCwAAAAIAABAQAAHAAAAAAAVQAAAfQAAAEAAAgACgAJAAgAAAAAAAMAAAADAAAAHAABAAAAAABUAAMAAQAAABwABAA4AAAACgAIAAIAAiX8JgYmCSZA//8AACX8JgUmCSZA///aB9n/2f3ZxwABAAAAAAAAAAAAAAAAAQYAAAEAAAAAAAAAAQIAAAACAAAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABIAEgASABoApgGIApIDLAAAAAIAHP/zAOAAwQADAAcAADcjFTMnIzUz4MTECLW1wc4HwAAAAAEAAAAAAAAAAAADAAA5AwAFAAj/5wD2ANIADwAzAFUAWQBdAAA3MzcXBxUXBzUjFQc2NTQnBjcXBwYHHgEVFAYHJic3Fj4BJwYHJzY3JicGByc2NyYnNxYfARYXNSMiByczNxcjFTM3FyMVFjcVBgcuAScGByc+ATcXBzc1IxU3NSMVgkoIDgcBDkwOAQEnCxEJDgwNCBUVAhwBGxoHBBsnAioWAwcVGgIXFAwYAhcSPQ8WJg0JCnwLEkYgChI8IScOBCsuERAZAxQUAhQISkxMTMULDQcwGAgODgQfFRUdFBgQAxEPGTAYKSUGDgkFBws2IygYBCYrDw8VDQQQGRMSBAwUihMJUQMKCxIiCxItCAMFAgwEFhkiEQQUMRkOBUscHCMaGgAABQAK/+cA8QDTAEoAYACOAJQAnQAAFz4BNTQnFzM1BgcUBgcnPgEnFzY3FwYHFwcVMzcXIxUzNxcOAQcnNyMVNzUXBxU/ARcHFRQ3MzI2NTMUFhcOASsBIj0BDwEnFgYHPwEXBgcnNj0BIw4BByc+AScXMzcXBzceATY1IxUzNxcHDgIHJic1FjMyPgE3IwcnNzQnFzM2NxcGBzM3FwcOAQcmJw8BJzM3FycWDgEjIicmJxoXAwENEhEYBhYDDAgCDjsQDA8bDAYOCQ0kFggNCQcIAgc2DxEGCwQNHAcNBAMEAwUECQUUDAYDBgEQFFIQAhAHCQMPAQwQAhAEAQ0OBwsHQQ0PAzVACg4JAwUNDQQcFQsHCAQCQQgNCAEQCAcCEQgOJAkOCAEEEAMUIQgINA0SKRQCBQIDAQIHESM5FxcbBysDAj1jJgEeWFoHCwwPAQUJBAgIDRIHDQIHCQESJQQbBwUMAwgJBxAHAQwKCwoBBwILEQIDAy4xFCIOAhQNCwUGIhsbDAIRKxAHBgsERgICGyJeCg4FLx0OBRAGBAQHJSEKCwddCwgVDAoEEwkNBiUeBg0FTgMIDRJ/Dg4ECAsLAAAKAAn/5QD2ANQALQB1AHkAfwCDAJAAmAChAKsAtAAANzM2NRcPATM3FwcVFwc1IxUzNxcjFTM3FwcOAgcmJzUWMzI2NyMVBzY9ATQnBgcnNjcnBgcnNjcnNxc2NwYHJzY3NSMiByczNDUXBxUzNxcjFTY3Fw8BHgEGJicGBxYXFgYnJicGBxcPATM3FwcGBx4BBiYnNxUzNQYHFzY3IzcVMzUnDwEeAQYmJwYHJzY3FxYOASY+AT8BHgEOATU0JicHHgEGIyI1NCYnNx4BBiMiLgEnlgkGFAgKJAgOBwEON0cLDmBKCA0HAgQMCgEZEAgIBQNKDgEBVyUBJhIbCgkDCQcGAwYMBhUYASYRGAsJCTUUBxsLEDYOBw8FBg0KCAgJBg4VCw0ICgoXCQcNBQgkCA4IDQoLDggQCko3nAQeCgopYDeZBwULBwkFBggLAwwGawIEDwcPBAE5DQoGCQgDJgsIBgUEBAIWEQQFBAMBBgSzGAkIBRQKDAYkEQYIEwsSFAkNBC8aEAQMBQQBDjwFBRUOQg0PyAwEFBUaCgcDCgoGAwQSEhMIBBcgLAIJEBQJBRYLEjAUGggECwYLDhEKCgwHCQoWDw4PDAcJAw0IDQUcCggOFBcJoRQUdAYWDBZTEhIKBAgHCg0SBw8MAhYdcxcTBwkRDgkJCw8PAQcIFAUFDBMNCgoPBwYPEgoQEQgAAAAABQAI/+MA+ADUACUAPwBVAGQAagAAFhcHNj0BIxUHNjU0JxczNCcXBxUzNxcHFRQGBy4BJzUWNj0BIxUnBhUnNjc1IwcnMzUjByczNxcjFTM3FyMVNzYGJjY3MxUzNjcXBgczNxcOAQcnNyM2FgczNxcjIgcnMy4BJzcHHgEGJietAQ8BIQ4BARAfARUHIgcPCAcKAwgLEggjSFEMGQ4OCwojEQsKPgoSJwgIECAoDw0OFAEFQgUIEwsPJgoOCAwIAwlwOBANKQoSbA0JCkEBCgoDExgEDQYMDgcIEBFBPQgRFRYYCA8SCgYRCQwGLgcMBAgHAwQCAQUtSzEhBxEGBkIDCjsDCgsSOwkQPw0yCggREQoMIAwEHAoQAwoLAhVYEAwLEgMKCREKAzINDw8bDQAAAAAAABAAxgABAAAAAAAAACgAUgABAAAAAAABAAoAkQABAAAAAAACAAYAqgABAAAAAAADABcA4QABAAAAAAAEAAoBDwABAAAAAAAFABwBVAABAAAAAAAGAAoBhwABAAAAAAAKAAYBogADAAEECQAAAFAAAAADAAEECQABABQAewADAAEECQACAAwAnAADAAEECQADAC4AsQADAAEECQAEABQA+QADAAEECQAFADgBGgADAAEECQAGABQBcQADAAEECQAKAA4BkgBDAHIAZQBhAHQAZQBkACAAYgB5ACAAKABoAHQAdABwADoALwAvAHcAdwB3AC4AZwB1AG8AeAB1AGUAZABhAHMAaABpAC4AbgBlAHQALwApAABDcmVhdGVkIGJ5IChodHRwOi8vd3d3Lmd1b3h1ZWRhc2hpLm5ldC8pAABLAGEAaQBYAGkAbgBTAG8AbgBnAABLYWlYaW5Tb25nAABOAG8AcgBtAGEAbAAATm9ybWFsAABLAGEAaQBYAGkAbgBTAG8AbgBnADoAVgBlAHIAcwBpAG8AbgAgADYALgAwADgAAEthaVhpblNvbmc6VmVyc2lvbiA2LjA4AABLAGEAaQBYAGkAbgBTAG8AbgBnAABLYWlYaW5Tb25nAABWAGUAcgBzAGkAbwBuACAANgAuADAAOAAgAEoAYQBuAHUAYQByAHkAIAA0ACwAIAAyADAAMQA4AABWZXJzaW9uIDYuMDggSmFudWFyeSA0LCAyMDE4AABLAGEAaQBYAGkAbgBTAG8AbgBnAABLYWlYaW5Tb25nAABfAF8AwwBbAIsATwBTAABfX8xbT1MAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAABAAIBAgEDAQQBBQEGB3VuaTI1RkMHdW5pMjYwNQd1bmkyNjA2B3VuaTI2MDkHdW5pMjY0MAAAAAEAAf//AA8AAQAAAAwAAAAWAAAAAgABAAEABwABAAQAAAACAAAAAAAAAAEAAAAA1CSYugAAAADR2ym+AAAAANZzceoAAVpNwWoAAA==) format('truetype');
font-weight: normal;
font-style: normal;
}
|
记得修改的时候,字符编码和源中保持一致,我现在展示的这个源是uft-8,所以我在下图中选择utf-8。
修改之后的代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
@font-face {
font-family: 'lm5pp_icomoon';
src: url("lm5pp_icomoon.ttf");
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'cjkextent';
src: url("cjkextent.ttf");
font-weight: normal;
font-style: normal;
}
|
把转换之后的字体文件也需要放到Contents
文件夹里面。
还有一个问题是原生词典不支持加载js
脚本的,所以很多时候都需要我们自己手动调整css
文件,让它全部展开。
经过上面的步骤,我们就可以得到了自己DIY之后的css,需要将自己修改好的css文件直接替换掉原先路径下的DefaultStyle.css
。使用别人提供的css也一样,也需要把字体从base64
格式转换成原有的格式。然后替换掉默认的css文件。
最后就是安装了,直接运行下面的命令就会将.dictionary
文件夹移动到~/Library/Dictionaries/
。
Example
LDOCE5
以下是我转换LDOCE5++时候用到的命令。里面有一些正则表达式可以参考。然后我自己基于lgmcw大佬和trivialstuff的css弄出了一个比较适合mac用的风格。修复上一版中原先的spx音频播放不了的问题。
1
2
3
4
5
|
python3 ./main.py --write-format=AppleDict ../LDOCE5/LDOCE5++\ V\ 2-15.mdx ../朗文当代5
cd ../朗文当代5
find OtherResources -name "*.spx" -execdir sh -c 'spx={};speexdec $spx ${spx%.*}.wav' \;
rm -rf OtherResources/*.spx
sed -i "" 's|\([/_a-zA-Z0-9]*\).spx|\1.wav|g' 朗文当代5.xml
|
lgmcw大和trivialstuff的css放入mac中都有一个问题,就是PHRASAL VERBS
默认不展开,我注意到应该是mdx导致的原因。所以我直接在css里面强制将这些项目全部展开了。以下是效果图。corpus默认不展开,需要的话可以自行在css里面调整。(直接搜索hide,里面会有提示的)
Cambridge English Pronouncing Dictionary - 18th Edition
为了节省电脑空间,还是需要把wav压缩一下的。
1
2
3
4
5
6
7
8
9
10
|
python3 ./main.py --write-format=AppleDict ../Cambridge\ English\ Pronouncing\ Dictionary\ -\ 18th\ Edition/Cambridge\ English\ Pronouncing\ Dictionary\ -\ 18th\ Edition.mdx ../Cambridge\ English\ Pronouncing\ Dictionary\ -\ 18th\ Edition-apple
cd ../Cambridge\ English\ Pronouncing\ Dictionary\ -\ 18th\ Edition-apple
find OtherResources -name "*.spx" -execdir sh -c 'spx={};speexdec $spx ${spx%.*}.wav' \;
# rm -f OtherResources/*.spx 可能报如下错误
# zsh: argument list too long: rm, 可以改成下面的命令
find OtherResources/ -name '*.spx'|xargs rm -f
cd OtherResources
sh ./converter.sh
cd ..
sed -i "" 's|\([/_a-zA-Z0-9]*\).spx|\1.mp3|g' Cambridge\ English\ Pronouncing\ Dictionary\ -\ 18th\ Edition-apple.xml
|
转换代码, converter.sh。将这个文件放入OtherResources里面, 直接运行即可。
1
2
3
4
5
6
|
for i in *.wav;
do
ffmpeg -i "$i" -f mp3 "${i}.mp3";
rename 's/.wav.mp3/.mp3/g' "${i}.mp3"
rm -f "$i" # 如果需要保留原先的wav文件,则注释掉这一行
done
|
转换之前我看了一下大概有15G。下图是转换过程中的一张图。原始的spx文件只有400m左右。请忽略文件数量的差异,因为我在转换过程中添加了一些脚本文件。左图显示的是正在转换过程中的截图,有脚本新增的也有还没来得及清理的文件。
wav
spx
mp3
Summary
本文只是做了一个很简单的介绍,如何将第三方词库(Mdic)转换为mac原生词库。时间允许的话我之后会做一个关于修改词典里面css的视频教程。敬请期待!