Skip to content

rclone using too much memory #2157

Closed
Closed
@calisro

Description

@calisro
Member

rclone v1.39-211-g572ee5ecβ

  • os/arch: linux/amd64
  • go version: go1.10

I recently upgraded from rclone.v1.38-235-g2a01fa9fβ to rclone v1.39-211-g572ee5ecβ

I'm using the following mount:
export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log
mount robgs-cryptp:Media $GS_RCLONE
--allow-other
--default-permissions
--gid $gid --uid $uid
--max-read-ahead 1024k
--buffer-size 50M
--dir-cache-time=72h
--umask $UMASK 2>&1 > /data/log/debug &

This worked flawlessly before. In the most recent version, it is using WAY too much memory causing OOM issues on my linux server. The above uses upwards of 7GIGS of resident memory before my OOM killer terminates the mount when activity is being read/written to the mount. I used to be able to crank up the buffer-size to 150M without issues. There is NOTHING else different between the setup. I can restore my old version and re-run the batch process without issues and then replace it with the new version and it consistently terminates due to OOM after consuming everything left.

If I restore the old version and rerun the exact same process, rclone consistently uses no more than 1.8G.

I can provide logs but there is nothing them. They simply show transfers. Even in the fuse-debug there is nothing but normal activity. There seems to be either a leak or the use of memory has changed DRASTICALLY between the versions making things unusable. I've rolled back till this is sorted.

Activity

remusb

remusb commented on Mar 20, 2018

@remusb
Collaborator

Are you using cache? If yes and you're also using Plex, can you disable that feature and see if it does the same thing?

It will be enough to just remove the configs from the section and start again.

LE: Even easier will be to run simply with -v.

calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

no I'm not using cache and it is running with -v normally. The older version didn't use cache as that was not implemented yet I dont think. The new one is using the same remotes and setup as the old. The only change is the executable upgrade.

remusb

remusb commented on Mar 20, 2018

@remusb
Collaborator

Hmm, ok. There was another report in the forum and that person uses cache. I was trying to establish a corellation but it seems to be coming from somewhere else.

ncw

ncw commented on Mar 20, 2018

@ncw
Member

I'd like to replicate this if possible. Can you describe the activity that causes the problem more?

Or maybe you've got a script I could run?

Which provider are you using? And are you using crypt too?

added this to the v1.41 milestone on Mar 20, 2018
calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

So the batch script is actually just a set of rclone commands to sync stuff up against the mount points.

SETUP:
I have a local mount point called /data/Media1
I have the rclone mounted on /data/Media2

This is my mount:
I'm using the following mount:

export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log 
mount robgs-cryptp:Media /data/Media2
--allow-other 
--default-permissions 
--gid $gid --uid $uid 
--max-read-ahead 1024k 
--buffer-size 50M 
--dir-cache-time=72h 
--umask $UMASK 2>&1 > /data/log/debug &

BATCH:
The batch script is simply copying new data from one to the other using this among other commands. But this is where most of hte activity starts and where the OOM starts to happen:

export RCLONE_STATS=5m
export RCLONE_CONFIG=/home/robert/.rclone.conf
export RCLONE_ONE_FILE_SYSTEM=true
export RCLONE_RETRIES=10
export RCLONE_BWLIMIT="07:00,5M 23:00,10M"
export RCLONE_TRANSFERS=8
export RCLONE_CHECKERS=16
export RCLONE_SKIP_LINKS=true
export RCLONE_DRIVE_USE_TRASH=false
rclone \
   -v \
   --modify-window 1ms \
   --exclude=**nfo \
 copy /data/Media1 /data/Media2 

I'm using google as a provider and it is using crypt as well. Media2 is simply synced with Media1 but I don't delete so it becomes a superset of what is copied to Media1. New files get put in Media1 and then replicated via this process.

danielloader

danielloader commented on Mar 20, 2018

@danielloader
Contributor

I am experiencing this too, only noticed as my IOWAIT shot up as I hit the swap and rclone pages were being pushed to the disk.
As to replication, can't seem to replicate it other than mount the remote with crypt/cache and try and play media on plex, after about 5 mins it thrashes the disk as memory shoots up from 200MB to the full 2GB on the VPS.

This wasn't the behaviour in the later betas of 1.39, anything changed in that respect?

ncw

ncw commented on Mar 20, 2018

@ncw
Member

I managed to replicate this...

First I made a directory with 1000 files in using

make_test_files -max-size 1000000 test-files

(this is one of the rclone tools)

I then mounted up

rclone mount /tmp/test1 /mnt/tmp/

and ran this little script to copy the files in and out of the mount

#!/bin/bash

for i in `seq 100`; do
    echo "round $i"
    rclone -v move /tmp/test-files/ /mnt/tmp/test-files
    rclone -v move /mnt/tmp/test-files /tmp/test-files/
    echo "waiting for 10s... (round $i)"
    sleep 10s
done

v1.39 uses a pretty constant 4MB, v1.40 goes up and up!

danielloader

danielloader commented on Mar 20, 2018

@danielloader
Contributor

https://i.imgur.com/Lvm01QB.png
Graph of when I replicate your steps, looks like a memory leak alright. Once RAM is exhausted it's just 100% IOWAIT stalled as it's trying to use swap.

ncw

ncw commented on Mar 20, 2018

@ncw
Member

OK here is the pprof memory usage - rclone was at about 2GB RSS when I took this.

So looks like it might be a bug in the fuse library... Though I'm not 100% sure about that.

profile001

(pprof) list allocMessage
Total: 1.83GB
ROUTINE ======================== github.com/ncw/rclone/vendor/bazil.org/fuse.allocMessage in /home/ncw/go/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go
    1.76GB     1.76GB (flat, cum) 96.03% of Total
         .          .    419:var reqPool = sync.Pool{
         .          .    420:	New: allocMessage,
         .          .    421:}
         .          .    422:
         .          .    423:func allocMessage() interface{} {
    1.76GB     1.76GB    424:	m := &message{buf: make([]byte, bufSize)}
         .          .    425:	m.hdr = (*inHeader)(unsafe.Pointer(&m.buf[0]))
         .          .    426:	return m
         .          .    427:}
         .          .    428:
         .          .    429:func getMessage(c *Conn) *message {
remusb

remusb commented on Mar 20, 2018

@remusb
Collaborator

Was it updated just before the release? It's a bit weird no one noticed this in the last betas.

A simple list of the vendor shows that it wasn't.

calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

I have this issue on 1.39 though.

calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

rclone v1.39-211-g572ee5ecβ

ncw

ncw commented on Mar 20, 2018

@ncw
Member

I used the magic of git bisect to narrow it down to this commit fc32fee

commit fc32fee4add93e7a3bda180c7c1088f93c759d13
Author: Nick Craig-Wood <nick@craig-wood.com>
Date:   Fri Mar 2 16:39:42 2018 +0000

    mount, cmount: add --attr-timeout to control attribute caching in kernel
    
    This flag allows the attribute caching in the kernel to be controlled.
    The default is 0s - no caching - which is recommended for filesystems
    which can change outside the control of the kernel.
    
    Previously this was at the default meaning it was 60s for mount and 1s
    for cmount.  This showed strange effects when files changed on the
    remote not via the kernel.  For instance Caddy would serve corrupted
    files for a while when serving from an rclone mount when a file
    changed on the remote.

Which is both good and bad.

Good because there is a workaround --attr-timeout 1s, but bad because this is the second bug I've bisected to the same commit today

I tried --attr-timeout 1ms and that still leaked a bit of memory, whereas --attr-timeout 1s seems OK.

I suspect this is a bug in the fuse library as rclone cmount (which is based off libfuse) doesn't show the same behaviour - it works fine with --attr-timeout 0.

I'll report a bug upstream in a moment...

22 remaining items

seuffert

seuffert commented on Mar 22, 2018

@seuffert
Contributor

I assumed rclone would buffer the 32MB + a bit overhead maybe and that would be it.

--buffer-size is per open file on a mount

neik1

neik1 commented on Mar 22, 2018

@neik1

OK! But how is it possible that there were so many files open that led to that issue when there was only one streaming going? That's the point I do not understand yet.

danielloader

danielloader commented on Mar 22, 2018

@danielloader
Contributor

Well if there's a library scan if it's not doing it sequentially it might be doing mediainfo lookups on multiple files at once. You could try changing the buffer size down to 8MB and see if it retains enough performance? Alternatively as said, write the 32MB buffers to disk as a swap of sorts.

I know on plex you can opt out of chapter/thumbnail creation which would read the whole file on scans, emby can opt out too?

neik1

neik1 commented on Mar 22, 2018

@neik1

With this adapted mount command it crashed again:

rclone_1.40 mount --log-level DEBUG --log-file /tmp/log.txt --allow-non-empty --allow-other --dir-cache-time 8760h --poll-interval 24h --vfs-cache-mode writes --attr-timeout 60s --memprofile /tmp/mem.prof --vfs-cache-max-age 0m upload: /media/cry
(Difference to the first is that --attr-timout was set to 60s, memprofile was added and buffer-size 32m was removed)

Log -> https://1drv.ms/t/s!AoPn9ceb766mgYsqBGX5KjSKxFVeiA
During the crash there was only one streaming ongoing and nothing else.
Unfortunately, the memusage script apparently stopped when the ssh connection disconnected. How can I start it so that it stays alive? I used ./memusage.sh > /tmp/mem.log &

My next step will be trying to avoid the mem cache at all with --cache-chunk-no-memory
€dit: Is this a rclone cache specific flag? I am not using cache!

@Daniel-Loader, I am not using chapter/thumbnail creation. The library scan for new files takes about 8min (scrapping of new files included). So, it's actually pretty fast I suppose.

danielloader

danielloader commented on Mar 22, 2018

@danielloader
Contributor

Yeah that's incredibly fast for a non cached remote media scan, depending on library size!

hklcf

hklcf commented on Mar 23, 2018

@hklcf
neik1

neik1 commented on Mar 23, 2018

@neik1

Yeah, well... At the end it doesn't seem to be a rclone issue. In my case it seems to be an Emby problem with a specific client.
Gonna try to narrow it down with the developers over there and will come back to report. ;-)

€dit: I am saying this because Emby crashed again but this time rclone didn't (probably because of the --cache-no-mem flag) and was only using 80mb of memory.

added a commit that references this issue on Mar 23, 2018
98a9246
ncw

ncw commented on Mar 24, 2018

@ncw
Member

I've committed a fix to change the default to --attr-timeout to 1s and made nodes in the docs about what happens when you set it to 0s.

Hopefuly upstream will come up with a fix eventually, but this will do for the moment.

ncw

ncw commented on Mar 24, 2018

@ncw
Member

This will be in

https://beta.rclone.org/v1.40-018-g98a92460/ (uploaded in 15-30 mins)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hklcf@ncw@seuffert@danielloader@calisro

        Issue actions

          rclone using too much memory · Issue #2157 · rclone/rclone