在线时间556 小时
UID2091055
注册时间2014-5-5
NXP金币104
TA的每日心情 | 无聊 2019-4-1 22:48 |
---|
签到天数: 302 天 连续签到: 1 天 [LV.8]以坛为家I
金牌会员
 
- 积分
- 4633
- 最后登录
- 2021-1-25
|

楼主 |
发表于 2017-6-8 18:15:13
|
显示全部楼层
本帖最后由 feixiang20 于 2017-6-8 18:17 编辑
Special usage: "Load balance" between master and slave
特殊用法:主从之间的“负载平衡”
It is like two people eat a plate of beans: Every one picks them until eaten up.
Master and slave do the same repetitive operations, such as
Processing all pixels in an image
Doing matrix multiplication for multiple matrix pairs.
Master initialize the input data, and setup an “item remaining” count-down counter that shows how many unprocessed operations.
Master send message to slave to notify slave to balance the processing load.
这就像两个人吃一盘豆子:每个人都挑,直到吃了。
主和从执行相同的重复操作,如
处理图像中的所有像素
多矩阵对的矩阵乘法。
主初始化输入数据,并设置一个“项目剩余”计数计数器显示多少未经处理的操作。
主发送消息到从属通知从属以平衡处理负载。
Load balance implementation
负载平衡的实现
Both core enters a processing loop, it is a “while (1)” loop, in it, both core:
Lock the remaining counter with H/W MUTEX,
If counter is 0, then unlock and exit loop
Otherwise substract the counter with a number “pick_count” and unlock. Pick count >=1.
pick “pick_count” unprocessed operations, but M4 pick from first to last, M0+ pick from last to first, and process them.
For slave, it also set a “busy” flag during the processing loop and clear it after exit.
For master, after it exits the processing loop, it wait until slave busy flag is cleared.
两个核心进入一个处理循环,它是一个“同时(1)”循环,在它,两个核心:
将剩余计数器锁定为H / W互斥,
如果计数器为0,则解锁并退出循环
否则减一号”pick_count”和解锁计数器。拾取计数> 1。
选择“pick_count“未加工的业务,但M4从开始到最后,M0 +挑选从后到前,处理。
对于奴隶,它还设置了一个“忙”的标志在处理循环和退出后清除它。
对于主进程,在它退出处理循环之后,它会等待直到从忙标记清除。
Load balance case study: Guassian blur for 128x128 image
负载平衡的案例研究:为128x128的图像的高斯模糊
it shows how time is saved when both M4 and M0+ process the pixels of one image, compared with use only M4 or only M0+ to process the same image. M4 and M0+ picks unprocessed pixels by themselves to process (just like two people compete to eat the same plate of beans)
In this demo, equivalent processing power of dual-core is about 154%-180% of single M4, or about 230% - 260% of single M0+. (Optimization: M4 code O2, M0+ code O2)
If running M4 code from the same SRAM block of M0+ code, both core will compete the same RAM block, and both core are slowed down to about 87.4% performance
它显示了如何节省时间当M4与M0 +过程图像的像素,使用M4或只有M0 +相比,处理相同的图像。M4和M0 +挑选自己未加工处理的像素(就像两人竞争吃豆子一样的板)
在本演示中,双核心的等效处理能力是154%单M4 180%,或约230%单M0 + 260%。(优化:M4代码O2,M0 +代码O2)
如果从相同的SRAM块M0 +代码运行M4代码,核心竞争同一内存块,而核心是放缓至约87.4%的性能
CPU at 12000kHz, Press the SW2 button to Start.
Running Gaussian blur with M4 only
154 ms elapsed!
Running Gaussian blur with both M4 and M0+
90 ms elapsed!
Running Gaussian blur with M0+ only
218 ms elapsed!
Test done
CPU at 48000kHz, Press the SW2 button to Start.
Running Gaussian blur with M4 only
40 ms elapsed!
Running Gaussian blur with both M4 and M0+
23 ms elapsed!
Running Gaussian blur with M0+ only
54 ms elapsed!
Test done
CPU在12000khz,按下SW2键开始。
仅与M4运行高斯模糊
154毫秒时间!
用M4和M0 +运行高斯模糊
90毫秒时间!
仅与M0 +运行高斯模糊
218毫秒时间!
做测试
CPU在48000khz,按下SW2键开始。
仅与M4运行高斯模糊
40毫秒时间!
用M4和M0 +运行高斯模糊
23毫秒时间!
仅与M0+运行高斯模糊
54毫秒时间!
已完成测试
|
|