Many short segments

jamiek · May 21, 2019, 12:38am

It is well known (I assume) that if CAM generates a large numbers of short segments, the communication between the PC (or Pi or what have you) can be a bottleneck and cause stuttering and slowness. There have been enough cases where running the same gcode file from an SD card works fine without issue.

This got me to wondering, what exactly is the bandwidth limit?

I’ve updated my test pattern generator to be able to create a sequence of many short segments end-to-end. The result, running both from PC (pronterface) and from SD are shown here:

A few observations:

The first line (from 0:04 to 0:55) took 51 seconds to draw 2500 segments, (49 segments/sec) approximately 67000 bytes (~1314 kB/s).
The second line (from 0:56 to 1:20) took 24 seconds to draw 1136 segments, (47 segments/sec) approximately 30000 bytes (~1250 kB/s).
The third line (from 1:21 to 1:38) took 17 seconds to draw 735 segments (43 segments/sec) approximately 20000 bytes (~1176 kB/s)

These are pretty similar and I think it can be presumed they are bandwidth limited, but 1200 kB/sec is about half the maximum capacity that would be expected from 250000 bps.

The 15th line is where Marlin appears to keep up with the little segments. This has segments of length 0.38 mm at a feed rate of 1000 mm/min, or about 44 segments per second. This is pretty close to the segment rate on the first three lines.

From SD card, the first line (from 4:30 to 4:42) takes about 12 seconds to draw 2500 segments, or about 200 segments per second, a bit over 4x as fast.

So, what can I do with this information? I’m not sure yet… maybe it’s pointless but I still think it’s interesting.

jeffeb3 · May 21, 2019, 3:14am

Isn’t that closer to 20x slower than it should be?

I wonder which part is creating that bottleneck. It’s definitely not the bandwidth, but the SD card is 4x faster, so maybe it’s latency? Maybe the time it takes for the serial port to say, ‘ok’ and the computer to semd more data is enough to slow it down? It could also be just from the fact that repetier host is not running real time, and OS delays are causing enough latency between commands.

I wonder how well it would work sending from another arduino. I also wonder if longer commands, like repeating the Y value or larger numbers affects this. My guess is it won’t, because it’s the lines per second that matter, not the line length. It would also be interesting for lasers to add in a power setting command on every other line. Would that half the top speed?

This is a good test for your setup, if you can find a way to improve this, them you have a good test to confirm it. What is the display showing you? planner buffer size and raw data?

The numbers 50 (or 44) and 200 seg/s are also useful. At 10mm/s, the smallest segments should be bigger than 0.2mm for USB and smaller than 0.05mm for SD. We meed more data from other machines, but that’s in the ballpark of what I’ve seen before.

vicious1 · May 21, 2019, 7:42am

Could it just be the USB chip on the ramps. I remember a very long time ago transferring a tiny file to the then external memory cards and it took forever, like a few hours. I know I researched it then but can not remember if it was hardware or firmware.

vicious1 · May 21, 2019, 7:42am

That display is more useful than I thought. I have one in a box, I might need to hook it up to one of the printers, super cool.

jamiek · May 21, 2019, 7:40pm

Oh jeez my math was way off. It was 1200 bytes per second, not kB, and yes, 250,000 bits/s should be about 25,000 bytes per second or a factor of 20.

Also my test pattern generator has an “efficient” mode that doesn’t repeat Y or Z or F. It makes a small difference but nowhere close to the 2x difference in bandwidth.

As for why it’s so far from ideal, your guess is as good as mine…

guffy · May 22, 2019, 3:52am

you may try to use repetier host (c#) or repetier server (c++) as a host instead of pronterface (python)
but probably this not will help too much. because i guess SD card operates on SPI much faster then COM
(but at same time handling SD card requires more cpu power due to parsing fat32 filesystem)

see HAL_SPI.h


/**
* SPI speed where 0 <= index <= 6
*
* Approximate rates :
*
* 0 : 8 - 10 MHz
* 1 : 4 - 5 MHz
* 2 : 2 - 2.5 MHz
* 3 : 1 - 1.25 MHz
* 4 : 500 - 625 kHz
* 5 : 250 - 312 kHz
* 6 : 125 - 156 kHz
*
* On AVR, actual speed is F_CPU/2^(1 + index).
* On other platforms, speed should be in range given above where possible.
*/

#define SPI_FULL_SPEED 0 // Set SCK to max rate
#define SPI_HALF_SPEED 1 // Set SCK rate to half of max rate
#define SPI_QUARTER_SPEED 2 // Set SCK rate to quarter of max rate
#define SPI_EIGHTH_SPEED 3 // Set SCK rate to 1/8 of max rate
#define SPI_SIXTEENTH_SPEED 4 // Set SCK rate to 1/16 of max rate
#define SPI_SPEED_5 5 // Set SCK rate to 1/32 of max rate
#define SPI_SPEED_6 6 // Set SCK rate to 1/64 of max rate

jamiek · February 12, 2022, 6:44pm

@Paciente8159 had asked about diagonal segments and whether the calculation of the diagonal would put additional CPU load that would affect processing.

I did a test with both straight and diagonal lines and the results are here:

These are the parameters

; mode: dense_segments
; rapid feedrate: 2000 mm/min
; raise/lower feedrate: 800 mm/min
; pen down z level: -0.5
; pen up z level: 0.5
; drawing feedrate: 1000 mm/min
; x extent: 50
; y extent: 20
; dense_minseg: 0.02
; dense_maxseg: 0.5
; dense_efficient: false

I timed the first five segments of each (to ~1 second accuracy) and these are the times I found:
Straight lines:
0:04 - 0:28 → 24s
0:30 - 0:45 → 15s
0:47 - 0:57 → 10s
0:59 - 1:07 → 8s
1:09 - 1:15 → 6s

Diagonal lines:
3:13 - 3:39 → 26s
3:40 - 3:56 → 16s
3:58 - 4:09 → 11s
4:10 - 4:18 → 8s
4:20 - 4:26 → 6s

This is with what I think are the same parameters as the previous test, but it’s sent from my Raspberry Pi, not from Pronterface and not from an SD card. The previous result had shown a 4x difference between Pronterface and the SD card, and Octoprint is in between, at about twice the speed of Pronterface and half the speed of the SD card.

With varying Y values, the gcode is about 20% larger because for the straight lines, each movement looks like this:
G1 X0.0200 Y0 Z-0.5 F1000
whereas with the diagonal lines, each movement looks like this:
G1 X0.0141 Y0.0141 Z-0.5 F1000

I can try also running diagonal lines in “efficient” mode that will generate movements like this:
G1 X0.0283 Y0.0283
According to my previous comments, it made a small difference, and maybe it can account for the diagonal lines being slower.

This is not to say that there is no impact at all from CPU load due to the calculations, but if the difference is smaller than handling the redundant text, then that’s worth knowing.

jamiek · February 12, 2022, 7:37pm

I tried both horizontal lines with extra “.0000” added so it’s more comparable in terms of communication bandwidth, and also diagonal lines in “efficient” mode, which omits the Z coordinate and feedrate when possible to decreases communication bandwidth.

The results for the first five lines are:
horizontal “extra zeroes”
0:04 - 0:29 → 25s
0:31 - 0:47 → 16s
0:49 - 0:59 → 10s
1:01 - 1:10 → 9s
1:11 - 1:18 → 7s

diagonal “efficient”
3:08 - 3:32 → 24s
3:33 - 3:48 → 15s
3:50 - 4:00 → 10s
4:02 - 4:10 → 8s
4:12 - 4:18 → 6s

Given that the start and end times are rounded to the nearest whole second as near as I can, and the differences are on the same order of a small number of whole seconds, this is not a very precise measurement. Perhaps a bit better is the total time for the first three segments together:

Test 1: horizontal (no decimal on y position): 53s
Test 2: diagonal: 56s
Test 3: horizontal (extra .0000 added to y position): 55s
Test 4: diagonal “efficient” with no redundant Z or F: 52s

Test 2 and test 3 have almost the same number of characters transmitted, and see a small difference, whereas test 3 and test 1 should be identical after parsing, so the difference should be entirely within the handling of the text, which includes the USB communication and parsing.

I can’t say this is conclusive, but to me it indicates that the CPU calculation of the diagonal steps is not a significant factor.